OCaml Planet

December 22, 2014

Shayne Fletcher

Compiling regular expressions (II)

This article is a continuation of the earlier post, "Compiling regular expressions (I)".

Automata are modeled as 'state' records with two fields. The pos field contains the set of positions that are valid for recognition in the given state. Transitions are modeled as lists of pairs of symbols and states. In this way a state may contain transitions that reference itself.

type state = {
pos : Int_set.t;
mutable trans : (char * state) list ;

We will require a function that for each input symbol $a$ and a given set of positions $s$, computes the list of pairs $(a, s')$ where $s'$ is the subset of $s$ that correspond to $a$.

let (partition : char option array -> Int_set.t
-> (char option * Int_set.t) list) =
fun chars s ->
let f acc c =
match c with
| Some _ ->
if List.mem_assoc c acc then acc
let f i acc =
if chars.(i)
c then acc else Int_set.add i acc in
(c, Int_set.fold f s (Int_set.empty)) :: acc
| None ->
if List.mem_assoc c acc then acc else (c, Int_set.empty) :: acc in
List.rev (Array.fold_left f [] chars)
This function makes a list from a set of ints.

let list_of_int_set : Int_set.t -> Int_set.elt list =
fun s -> List.rev (Int_set.fold (fun e acc -> e :: acc) s [])
This function, accessible given a state, computes the list of sets that accessible from that state.

let (accessible : state -> Int_set.t array -> char option array
-> (char * Int_set.t) list) =
fun s follow chars ->
let part = partition chars s.pos in
let f p rest =
match p with
| (Some c, l) ->
(List.map (Array.get follow) (list_of_int_set l))
) :: rest
| _ -> rest in
List.fold_right f part []
find_state takes a set $s$ and two lists of states (marked and unmarked). It searches for a state which has a pos field equal to $s$ and returns this state or it fails.

let (find_state : Int_set.t -> state list -> state list -> state) =
fun s l m ->
let test e = e.pos = s in
List.find test l
| Not_found -> List.find test m

The algorithm to compute the automata works like this. Two lists are maintained, marked and unmarked states. The algorithm is initialized such that the only state is unmarked with a pos field containing first_pos $n_{0}$ where $n_{0}$ is the root of the syntax tree; the list of transitions is empty.

For an unmarked state $st$, the algorithm does these things:

  • Calculate a set of numbers accessible from $st$. That is, a set of pairs $(c, s)$, where $c$ is a character and $s$ a set of positions. A position $j$ is accessible from $st$ by $c$ if there is an $i$ in st.pos such that $j$ is in follow $i$ and $i$ numbers the character $c$.
  • For each of the pairs $(c, s)$
    • If there exists a state st' (whether marked or unmarked) such that $s = $st'.pos, it adds $(c, st')$ to the transitions of $st$;
    • Otherwise, a new state $st'$ without transitions is created, added to the transitions of $st$, and $st'$ is added to the list of unmarked states.
  • It marks $st$.
The algorithm terminates only when there are no remaining unmarked states. The result is an array of states obtained from the list of marked states. The terminal states are all those containing the number associated with Accept. Here then is the algorithm in code.

let rec (compute_states : state list -> state list -> Int_set.t array
-> char option array -> state array) =
fun marked unmarked follow chars ->
match unmarked with
| [] -> Array.of_list marked
| st :: umsts ->
let access = accessible st follow chars in
let marked1 = st :: marked in
let f (c, s) umsts =
if Int_set.is_empty s then
umsts (*Suppress empty sets*)
st.trans <- (c, find_state s marked1 umsts) ::st.trans ;
| Not_found ->
let state1 = {pos = s; trans = []} in
st.trans <- (c, state1) :: st.trans;
state1 :: umsts in
let unmarked1 = List.fold_right f access umsts in
compute_states marked1 unmarked1 follow chars

We are just about ready to write the function to compute the automaton. It is fundamentally a call to compute_states but does one more thing. That is, it searches the resulting array for the index of the initial state and puts the index in the first slot of the array. To do this it uses the utility function array_indexq which performs the search for the index using physical equality. This is because the usual test using structural equality will not terminate on structures that loop.

let (array_indexq : 'a array -> 'a -> int) =
fun arr e ->
let rec loop i =
if i = Array.length arr then
raise (Not_found)
else if Array.get arr i == e then i
else loop (i + 1) in
loop 0
So, here it is, dfa_of, the function to compute the automaton.

let (dfa_of : augmented_regexp * Int_set.t array * char option array
-> state array) =
fun (e, follow, chars) ->
let init_state = {pos = first_pos e; trans = []} in
let dfa = compute_states [] [init_state] follow chars in
(*Installing initial state at index 0*)
let idx_start = array_indexq dfa init_state in
dfa.(idx_start) <- dfa.(0);
dfa.(0) <- init_state;

We are now on the home stretch. All that remains is to write a function to interpret the automaton. To do this, we'll make use of a mini-combinator library of recognizers. I'll not provide the OCaml code for that today - you could reverse engineer from my earlier 'Recognizers' blog-post or, consult [1].

let (interpret_dfa : state array -> int -> char Recognizer.recognizer) =
fun dfa accept ->
let num_states = Array.length dfa in
let fvect = Array.make (num_states) (fun _ -> failwith "no value") in
for i = 0 to num_states - 1 do
let trans = dfa.(i).trans in
let f (c, st) =
let pc = Recognizer.recognizer_of_char c in
let j = array_indexq dfa st in
Recognizer.compose_and pc (fun l -> fvect.(j) l) in
let parsers = List.map f trans in
if Int_set.mem accept (dfa.(i).pos) then
fvect.(i) <- compose_or_list
(Recognizer.end_of_input) parsers
else match parsers with
| [] -> failwith "Impossible"
| p :: ps -> fvect.(i) <- Recognizer.compose_or_list p ps
We wrap up with a couple of high level convenience functions : compile produces a recognizer from a string representation of a regular expression and match takes a recognizer (that is, a compiled regular expression) and a string and uses the recognizer to categorize the given string as admissible or not (where explode is a simple function that transforms a string into a char list - recognizers operate on lists).

let compile xpr =
let ((e, follow, chars) as ast) = regexp_follow xpr in
let dfa = dfa_of ast in
let parser = interpret_dfa dfa (Array.length chars - 1) in
fun s -> parser (explode s)

let re_match xpr s =
let result = xpr s in
match result with
| Recognizer.Remains [] -> true
| _ -> false

Here's a simple test driver that shows how these functions can be used.

let test xpr s =
match re_match xpr s with
| true -> Printf.printf "\"%s\" : success\n" s
| false -> Printf.printf "\"%s\" : fail\n" s

let _ =
let xpr = compile "(a|b)*abb" in
Printf.printf "Pattern: \"%s\"\n" "(a|b)*abb" ;
test xpr "abb" ;
test xpr "aabb" ;
test xpr "baabb" ;
test xpr "bbbbbbbbbbbbbaabb" ;
test xpr "aaaaaaabbbaabbbaabbabaabb" ;
test xpr "baab" ;
test xpr "aa" ;
test xpr "ab" ;
test xpr "bb" ;
test xpr "" ;
test xpr "ccabb" ;
| Failure msg -> print_endline msg

So that's it for this series of posts on building recognizers for regular expressions. Hope you enjoyed it!

[1] "The Functional Approach to Programming" - Cousineau & Mauny
[2] "Compilers Principles, Techniques & Tools" - Aho et. al.

by Shayne Fletcher (noreply@blogger.com) at December 22, 2014 11:47 PM

Compiling regular expressions (I)

This post picks up from here which was concerned with parsing - obtaining representations of regular expressions as abstract syntax trees. The ultimate goal is, given a string representation of a regular expression $e$ , produce a 'recognizer' for the expression (referred to as compiling a regular expression). That is, a function string -> bool that can be used to categorize strings as either belonging to the language $\mathcal{L_{e}}$ or not.

Having produced an abstract syntax tree for a regular expression $e$, the first step in compiling the expression is to compute an abstract syntax tree of the corresponding augmented regular expression $(e)\#$. This augmented regular expression is the original expression $e$ concatenated with a unique end-marker $\#$. For the given expression $e$, the accepting state for $e$ is given a transition on $\#$. This is a device that allows us to "forget" about accepting states as the computation of a recognizer proceeds; when the construction is complete, any state with a transition on $\#$ must be an accepting state.

Leaves in the abstract syntax tree of the augmented regular expression $(e)\#$ are labeled by $\epsilon$ or a symbol from from $\mathcal{A}$. For those non-$\epsilon$ leaves we attach a unique integer. Accordingly, we will need functions to generate unique integers (positions) that we will employ as we transform the AST of $e$ into the AST of the augmented expression $(e)\#$ leading to our first code example.

let reset_label, generate_label =
let r = ref (-1) in
((fun () -> r := (-1)), (fun () -> r := !r + 1; !r))

As we construct the syntax tree of the $(e)\#$ we compute four functions : null_pos, first_pos, last_pos and following:

  1. null_pos is $true$ for a syntax-tree node $n$ if and only if the sub-expression represented by $n$ has $\epsilon$ in its language. That is, $true$ if the regular sub-expression recognizes the empty string and $false$ otherwise;
  2. first_pos is the set of positions in the sub-tree rooted at $n$ that correspond to the first symbol of at least one string in the language of the sub-expression rooted at $n$. That is, the set of symbols that can begin a string recognized by the regular sub-expression;
  3. last_pos is the set of positions in the sub-tree rooted at the syntax-tree node $n$ that corresponds to the last symbol of at least one string in the language of the sub-expression rooted at $n$. That is, the set of symbols that can terminate a string recognized by the regular sub-expression;
  4. following, for a position $p$ is the set of positions $q$ in the entire syntax-tree such that there is some string $x = a_{1}a_{2} \cdots a_{n}$ in $\mathcal{L_{(e)\#}}$ such that for some $i$, there is some way to explain the membership of $x$ in $\mathcal{L_{(e)\#}}$ by matching $a_{i}$ to $p$ and $a_{i + 1}$ to a position in $q$.
Of these, following is the last to compute as it relies upon the values of first_pos and last_pos. If the definition is confusing for now, don't worry about it. The rules for computing following will come later and will be obvious at that point. We'll focus for now on null_pos, first_pos and last_pos.

The results of first_pos, last_pos and follow are sets of integers. Accordingly, we are going to need a type to represent these.

module Int_set : Set.S with type elt = int = Set.Make (
let compare = Pervasives.compare
type t = int
With this we can present the type of ASTs for augmented regular expressions.

type augmented_regexp =
| Epsilon
| Character of char * int
| Sequence of augmented_regexp * augmented_regexp * pos
| Alternative of augmented_regexp * augmented_regexp * pos
| Repetition of augmented_regexp * pos
| Accept of int
and pos = {

For a given node $n$, the values of its pos record depend only on the sub-expressions of that node. Assuming constructed augmented regular expression syntax trees, we can write null_pos, first_pos and last_pos like this.

let (null_pos : augmented_regexp -> bool) =
fun x ->
match x with
| Epsilon -> true
| Character (_, i) -> false
| Sequence (_, _, p) -> p.null
| Alternative (_, _, p) -> p.null
| Repetition (_, p) -> p.null
| Accept _ -> false

let (first_pos : augmented_regexp -> Int_set.t) =
fun x ->
match x with
| Epsilon -> Int_set.empty
| Character (_, i) -> Int_set.add i (Int_set.empty)
| Alternative (_, _, p) -> p.first
| Repetition (_, p) -> p.first
| Sequence (_, _, p) -> p.first
| Accept i -> Int_set.add i (Int_set.empty)

let (last_pos : augmented_regexp -> Int_set.t) =
fun x ->
match x with
| Epsilon -> Int_set.empty
| Character (_, i) -> Int_set.add i (Int_set.empty)
| Alternative (_, _, p) -> p.last
| Repetition (_, p) -> p.last
| Sequence (_, _, p) -> p.last
| Accept i -> Int_set.add i (Int_set.empty)

Our strategy in building the syntax-tree of $(e)\#$ from the syntax tree of $e$ will be to visit each node of $e$ and invoke a function to construct the corresponding node of $(e)\#$ inductively. These functions will include the generation of unique integers for the non-$\epsilon$ leaves and encode the rules for building the pos records:

  • null
    • Sequence $(e_{1}, e_{2})$ : null_pos $e_{1}$ and null_pos $e_{2}$
    • Alternative $(e_{1}, e_{2})$ : null_pos $e_{1}$ or null_pos $e_{2}$
    • Repetition : $true$
  • first
    • Alternative $(e_{1}, e_{2})$ : first_pos $e_{1} \cup$ first_pos $e_{2}$
    • Sequence $(e_{1}, e_{2})$ : if null_pos $e_{1}$ then first_pos $e_{1} \cup$ first_pos $e_{2}$ else first_pos $e_{1}$
    • Repetition $e$ : first_pos $e$
  • last
    • Alternative $(e_{1}, e_{2})$ : last_pos $e_{1} \cup$ last_pos $e_{2}$
    • Sequence $(e_{1}, e_{2})$ : if null_pos $e_{2}$ then last_pos $e_{1} \cup$ last_pos $e_{2}$ else last_pos $e_{2}$
    • Repetition $e$ : last_pos $e$

Here then are the augmented regular expression syntax-tree node constructor functions.

let (epsilon : unit -> augmented_regexp) =
fun () ->

and (character : char -> augmented_regexp) =
fun c ->
Character (c, generate_label ())

and (repetition : augmented_regexp -> augmented_regexp) =
fun e ->
Repetition (e, {null=true;first=first_pos e; last=last_pos e})

and (alternative : augmented_regexp -> augmented_regexp -> augmented_regexp) =
fun e1 e2 ->
Alternative (e1, e2,
{null=null_pos e1 || null_pos e2;
first=Int_set.union (first_pos e1)(first_pos e2);
last=Int_set.union (last_pos e1) (last_pos e2)})

and (sequence : augmented_regexp -> augmented_regexp -> augmented_regexp) =
fun e1 e2 ->
let b1 = null_pos e1
and b2 = null_pos e2 in
Sequence (e1, e2,
{null=b1 && b2;
if b1 then
Int_set.union (first_pos e1)(first_pos e2)
first_pos e1;
if b2 then
Int_set.union (last_pos e1) (last_pos e2)
last_pos e2})

let (accept : augmented_regexp -> augmented_regexp) =
fun e ->
sequence e (Accept (generate_label ()))

We are now in a position to write the function that transforms a syntax tree of the regular expression $e$ into the syntax tree of the augmented regular expression $(e)\#$.

let rec (augmented_regexp : Syntax.regular_expression -> augmented_regexp) =
fun x ->
match x with
| Syntax.Epsilon -> epsilon ()
| Syntax.Character i -> character (Char.chr i)
| Syntax.Sequence (x, y) ->
(*Be very careful here. Evaluation order matters!*)
let x' = (augmented_regexp x)
and y' = (augmented_regexp y) in
sequence x' y'
| Syntax.Alternative (x, y) ->
(*Be very careful here. Evaluation order matters!*)
let x' = (augmented_regexp x)
and y' = (augmented_regexp y) in
alternative x' y'
| Syntax.Repetition x -> repetition (augmented_regexp x)

We can wrap all of the above up in a convenience function parse_augmented_regexp which first parses a string to build the syntax tree of the regular expression it represents and then transforms the result into the syntax tree of the corresponding augmented regular expression.

let (parse_augmented_regexp : string-> augmented_regexp * int) =
fun s ->
let () = reset_label () in
let ast = regexp_of_string s in
let re1 = augmented_regexp ast in
let re2 = accept re1 in
let count = generate_label () in
(re2, count)
Notice that this function returns a pair of the syntax-tree and the number of positions it contains.

The next step in compiling a recognizer from the expression $(e)\#$ is to compute the follow function. To do this we "unite" the information encoded by the first_pos and last_pos functions. Put plainly, follow is a function that takes each symbol (position) in the regular expression to the (set of) symbols (positions) that can follow it. The information is stored in an array of length equal to the number of symbols appearing in the regular expression. There are only two ways a position in a regular expression can follow another:

  • If $n$ is a Sequence node with left child $c_{1}$ and right child $c_{2}$, then for every position $i$ in lastpos $c_{1}$, all positions in firstpos $c_{2}$ are in follow_pos $i$
  • If $n$ is a Repition and $i$ a position in lastpos $n$, then all positions in first_pos $n$ are in follow_pos $i$
In addition to computing follow the code below also stores the association between positions and characters of the regular expression. That information goes into an array. The elements of the array have type char option since the Accept symbol has a position but no character associated with it.

let (compute_follow : Int_set.t array -> char option array -> augmented_regexp -> unit) =
fun follow chars x ->
let rec compute x =
match x with
| Sequence (e1, e2, p) ->
compute e1; compute e2;
let first2 = first_pos e2 in
let f i =
follow.(i) <- Int_set.union first2 (follow.(i)) in
Int_set.iter f (last_pos e1)
| Repetition (e, p) ->
compute e;
let f i =
follow.(i) <- Int_set.union (p.first) (follow.(i)) in
Int_set.iter f (p.last)
| Alternative (e1, e2, p) -> compute e1; compute e2
| Epsilon -> ()
| Accept i -> chars.(i) <- None
| Character (c, i) -> chars.(i) <- Some c in
compute x

Now the computation of the augmented regular expression syntax-tree and all four of the associated functions together with the mapping from positions to symbols of $\mathcal{A}$ can be wrapped up in another "high-level" convenience function.

let (regexp_follow : string -> augmented_regexp * Int_set.t array * char option array) =
fun s ->
let re, n = parse_augmented_regexp s in
let follow = Array.make n (Int_set.empty) in
let chars = Array.make n None in
compute_follow follow chars re;
(re, follow, chars)

We're in good shape - but a hop, skip and a jump to computing a recognizer from a regular expression. We'll leave off here on this local maxima for today and finish off the job in the next post!

by Shayne Fletcher (noreply@blogger.com) at December 22, 2014 04:31 PM


LexiFi is hiring OCaml developers

LexiFi is hiring!

We are looking for full time software developers, with a passion for functional programming, ideally in OCaml, and for beautiful code.

Learn about how LexiFi uses OCaml: http://www.lexifi.com/product/technology/ocaml

No prior knowledge of finance is required. If you are interested, please send your resume to careers@lexifi.com.

read more

by Alain Frisch at December 22, 2014 02:44 PM

December 19, 2014


OCaml 4.01 for iOS 8 Simulator

December 19, 2014

OCamlXARM compiles for an iOS device, but OCamlXSim compiles for an iOS Simulator. The same ocamloptrev script that compiles OCaml for iOS 8 can also get OCamlXSim to compile OCaml for the iOS 8 Simulator. The only thing that changes is the location of the compiler.

If you want to try out OCaml on the iOS 8 Simulator, here is an update to the script that compiles for either an iOS device or an iOS Simulator (ocamloptrev):

# ocamloptrev     ocamlopt for specified iOS revision
USAGE='ocamloptrev  -rev M.N  [ -sim ]  other-ocamlopt-options ...'


declare -a ARGS
while [ $# -gt 0 ] ; do
    case $1 in
        if [ $# -gt 1 ]; then
            shift 2
            echo "$USAGE" >&2
            exit 1
    *)  ARGS[${#ARGS[*]}]="$1"
        shift 1
if [ "$REV" = "" ]; then
    echo "$USAGE" >&2
    exit 1


case $SIM in
y)  PLT=$HIDEOUT/Platforms/iPhoneSimulator.platform 
n)  PLT=$HIDEOUT/Platforms/iPhoneOS.platform 

$OCAMLC -ccopt -isysroot -ccopt "$PLT$SDK" "${ARGS[@]}"

To compile for the iOS Simulator, specify -sim along with -rev M.N.

Let’s make a tiny OCaml program for testing:

$ Q="Do you know what it's like on the outside?\\n"
$ echo "Printf.printf \"$Q\"" > ny1941.ml

Here’s what happens if you compile with the current OCamlXSim on a system with the iOS 8.1 SDK:

$ /usr/local/ocamlxsim/bin/ocamlopt -o ny1941 ny1941.ml
clang: warning: no such sysroot directory: '/Applications/Xcode.app/Contents/Developer/Platforms/iPhoneSimulator.platform/Developer/SDKs/iPhoneSimulator7.1.sdk'
ld: library not found for -lSystem
clang: error: linker command failed with exit code 1 (use -v to see invocation)
File "caml_startup", line 1:
Error: Error during linking

As you can see, it’s trying and failing to use the default iOS Simulator 7.1 SDK. Here’s how to use ocamloptrev (the above script):

$ ocamloptrev -sim -rev 8.1 -o ny1941 ny1941.ml
$ ls -l ny1941
-rwxr-xr-x  1 jeffsco  staff  303364 Dec 19 23:02 ny1941
$ file ny1941
ny1941: Mach-O executable i386

You can actually run an iOS simulator app from the OS X command line, though there are many things that don’t work properly.

$ ny1941
Do you know what it's like on the outside?

See iOS Simulator Vs. OS X for a description of some differences between the OS X and the iOS Simulator environments.

If you don’t specify -sim, the script compiles for an iOS device as before:

$ ocamloptrev -rev 8.1 -o ny1941 ny1941.ml -cclib -Wl,-no_pie
$ file ny1941
ny1941: Mach-O executable arm

When not working in the subbasement of my alma mater, I’m working in my cluttered underground workroom on several OCaml-on-iOS projects. Along with holiday joys and the delights of coding in node.js during the day, I’ll keep working through them as fast as I can.

I hope this script will be useful for folks who want to try OCaml on the iOS Simulator while I’m updating my humble patches to the latest versions of everything and keeping all the irons in the fire.

If you have any trouble (or success) with the script, or have any other comments, leave them below or email me at jeffsco@psellos.com.

Posted by: Jeffrey

<style type="text/css"> .flowaroundimg { float: left; margin: 0em 1em 0em 0em; } .rightoffloat li { position: relative; left: 1em; } pre { white-space: pre-wrap; width: 96%; margin-bottom: 24px; overflow: hidden; padding: 3px 10px; -webkit-border-radius: 3px; background-color: #fed; border: 1px solid #dcb; } pre code { white-space: pre-wrap; border: none; padding: 0; background-color: transparent; -webkit-border-radius: 0; } td { padding: 0em 1em 0em 1em; } th { padding: 0em 1em 0em 1em; } </style>

December 19, 2014 07:00 PM


Become a BST Ninja - Genin Level


Binary Search Tree (BST) is one of the most classic data structures. The definition for its structure is shown as below:

  • It consists of Nodes and Leaves.
  • A Node has a child bst on the left side, a key (, a data), and a child bst on the right side. Note here a node's left or right child is not a node, instead, is indeed another binary search tree.
  • A Leaf has nothing but act only as a STOP sign.
type 'a bst_t =  
  | Leaf
  | Node of 'a bst_t * 'a * 'a bst_t (* Node (left, key, right) *)
(* a node may also carry a data associated with the key. 
   It is omitted here for neater demonstration *)

The important feature that makes BST unique is

  • for any node
    • all keys from its left child are smaller than its own key
    • all keys from its right child are bigger (assumming keys are distinct)

And here is an example of BST:


Instead of continuing to present the basics of BST, this post will now focus on how to attack BST related problems with the most powerful weapon: Recursion.

Recursion on BST

From Recursion Reloaded, we know that one way to model recursion is:

  1. Assume we already got a problem solver: solve.
  2. Don't think about what would happen in each iteration.
  3. Split the problem to smallers sizes (N = N1 + N2 + ...).
  4. Solve those smaller problems like solve N1, solve N2, ... Here is the tricky part: still, we do not know or care what are inside solve and how f would do the job, we just believe that it will return the correct results.
  5. Now we have those results for smaller problems, how to wire them up to return the result for N? This is the ultimate question we need to answer.
  6. Of course, we do not forget the STOP sign for some edge cases.
  7. Together with point 5 and 6, we get our solve.

A good thing coming from BST is that the split step has been done already, i.e., a BST problem can be always divided into left child, root, and right child.

Thus if we assume we already got solve, we just need to solve left and / or solve right, then do something with root, and finally wire all outcomes to obtain the final result. Again, we should of course never forget the STOP sign and in BST, usually it is the Leaf, i.e., we need to ask ourselves what if the BST is a Leaf.

The thinking flow is illustrated as the diagram below.


Before we start to look at some problems, note that in the diagram above or Recursion Reloaded, we seem to always solve both left and right, or say, all sub-problmes. It is actually not necessary. For BST, sometimes either left or right is enough. Let's have a look at this case first.

Either Left or Right

Our starter for this section is the simplest yet very essential operation: insert a key to a BST.


From the definition of BST, we know the basic rule is that if the new key is smaller than a root, then it belongs to the root's left child; otherwise, to the root's right child. Let's follow the modelling in the previous diagram to achieve this.

  1. We assume we already got insert key bst
  2. We know the problem can be split into left, root, and right.
  3. Because a new key can go either left or right, so we need to deal_with root first, i.e., compare the new key and the key of the root.
  4. Directly taken from the rule of BST, if the new key is smaller, then we need to solve left thus insert key left; otherwise insert key right.
  5. What if we get to a Leaf? It means we can finally place our new key there as a new Node and of course, at this moment both children of the new Node are Leaves.


Note that the BST type in OCaml we defined early on is pure functional, which means every time we need to update something, we have to create new. That's why in the diagram, even if we just insert x to left or right, we need to construct a new Node because we are updating the left child or the right one. The code is shown as below.

let rec insert x = function  
  | Leaf -> Node (Leaf, x, Leaf) (* Leaf is a STOP *)
  | Node (left, k, right) ->
    if x < k then Node (insert x left, k, right) (* if x is smaller, go left *)
    else Node (left, k, insert x right) (* otherwise, go right *)


member is to check whether a given key exists in the BST or not. It is very similar to insert. There are three differences:

  1. When dealing with root, we need to see whether the given key equals to the root's key or not. If yes, then we hit it.
  2. member is a readonly function, so we directly solve left or right without constructing new Node.
  3. If arriving at a Leaf, then we have nowhere to go any more and it means the given key is not in the BST.
let rec mem x = function  
  | Leaf -> false
  | Node (left, k, right) ->
    if x < k then mem x left
    else if x = k then true
    else mem x right

Both Left and Right

Usually when we need to retrieve some properties of a BST or possibly go through all nodes to get an answer, we have to solve both children. However, the modelling technique does not change.


Recall from Some properties of a tree, the height of a tree is the number of edges that the longest path (from the root to a leaf) has. From this definition, it seems easy to get the height. We simply try to find all possible paths from root and for each path we record its number of edges. The answer will be the max of them. It sounds straightforward, but if you really try to write the code in this way, I bet the code would be a bit messy. Honestly, I never wrote in this way and I will never do that.

Another way is to think recursively. First let analyse a little bit about the longest path matter.


As we can see from the above diagram, Root has two edges: one to Left and the other to Right. So whatever the longest path from Root might be, it must pass either Left or Right. If we somehow could obtain the longest path from the root of Left and the longest path from the root of Right, the longest path from Root should be the bigger one of the two paths, right?

Let's now assume we already got height and it will return the height of a BST. Then we can obtain h_left and h_right. The answer is what is the h (height of Root)? Note that the height implies the longest path already (that's the definition). So from the paragraph above, What we need to do is getting max h_left h_right. Since Root has an edge to either child, h = 1 + max h_left h_right.

Don't forget the STOP sign: the height of a Leaf is 0.

let rec height = function  
  | Leaf -> 0
  | Node (left, _, right) -> 1 + max (height left) (height right)

Simple, isn't it?

Keys at a certain depth

So far, it seems our hypothetic solve function takes only the sub-probem as parameter. In many cases this is not enough. Sometimes we need to supply more arguments to help solve. For example, in the problem of retriving all keys at a certain depth definitely needs current depth information.


Only with the help of current_depth, the Root can know whether it belongs to the final results.

  1. If current_depth = target_depth, then Root should be collected. Also we do not continue to solve Left or Right as we know their depths will never equal to target_deapth.
  2. Otherwise, we need to solve both Left and Right with argument of 1 + current_depth.
  3. Assume our solve is working. Then solve left (1+current_depth) would return a list of target nodes and so does solve right (1+current_depth). We simply then concatenate two target lists.
  4. STOP sign: Leaf is not even a Node, so the result will be empty list.

The code is like this:

let rec from_depth d cur_d = function  
  | Leaf -> []
  | Node (left, k, right) ->
    if cur_d = d then [k]
    else from_depth d (cur_d + 1) left @ from_depth d (cur_d + 1) right

let all_keys_at depth bst = from_depth d 0 bst  


From Ninja Wiki

A system of rank existed. A jōnin ("upper man") was the highest rank, representing the group and hiring out mercenaries. This is followed by the chūnin ("middle man"), assistants to the jōnin. At the bottom was the genin ("lower man"), field agents drawn from the lower class and assigned to carry out actual missions.


Some readers contacted me. They hope that maybe I can use more advanced knowledge or harder examples in my posts and the current ones might seem a little boring. I think I need to explain a bit here.

The general idea behind Many things about OCaml is not to write a cookbook for certain problems related to OCaml or be a place where quick solution is there and copy / paste sample code would do the trick. Instead, Many things means some important aspects in OCaml that might be overlooked, or some particular problems that can show the greatness of OCaml, or the elegant OCaml solutions for some typical data structures and algorithms, etc. As long as something are valuable and that value shows only in OCaml or Functional Programming, I would like to add them all in here one by one.

Fortunately or unfortunately, even though I have only limited experiences in OCaml, I found that the many is actually quite big. And due to this many, I had to make a plan to present them all in a progressive way. Topics can interleave with each other in terms of time order as we do not want to have the same food for a month. More importantly, however, all should go from simple / easy to advanced / hard. In order to present some advanced topic, we need to make sure we have a solid foundation. This is why, for example, I even produced a post for the properties of a tree although they are so basic. This is also why I reloaded recursion since recursion is everywhere in OCaml. They are a kind of preparations.

Moreover, I believe in fundamentals. Fundamentals are normally concise and contain the true essence. But sometimes they can be easily overlooked or ignored and we may need to experience a certain number of difficult problems afterwards, then start to look back and appreciate the fundamentals.

The reason of using simple examples is that it makes my life easier for demonstrations. I love visualisations and one graph can be better than thousands of words. For complicated problems and solutions, it is a bit more difficult to draw a nice and clean diagram to show the true idea behind. I will try to do so later on, but even if I could achieve it in this early stage, some readers might easily get lost or confused because of the unavoidable complication of the graph. As a result, the point of grasping fundamentals might be missed.

Anyway, please don't worry too much. Attractive problems in OCaml are always there. For example, in my plan, I will later start to present a number (maybe 15 ~ 17) of my beloved Functional Pearls in OCaml and if you are really chasing for some awesomeness, I hope they would satisfy you.

by Jackson Tale at December 19, 2014 12:28 PM

December 18, 2014

OCaml Platform

OPAM 1.2 and Travis CI

The new pinning feature of OPAM 1.2 enables new interesting workflows for your day-to-day development in OCaml projects. I will briefly describe one of them here: simplifying continuous testing with Travis CI and GitHub.

Creating an opam file

As explained in the previous post, adding an opam file at the root of your project now lets you pin development versions of your project directly. It's very easy to create a default template with OPAM 1.2:

$ opam pin add <my-project-name> . --edit
[... follow the instructions ...]

That command should create a fresh opam file; if not, you might need to fix the warnings in the file by re-running the command. Once the file is created, you can edit it directly and use opam lint to check that is is well-formed.

If you want to run tests, you can also mark test-only dependencies with the {test} constraint, and add a build-test field. For instance, if you use oasis and ounit, you can use something like:

build: [
  ["./configure" "--prefix=%{prefix}%" "--%{ounit:enable}%-tests"]
build-test: [make "test"]
depends: [
  "ounit" {test}

Without the build-test field, the continuous integration scripts will just test the compilation of your project for various OCaml compilers. OPAM doesn't run tests by default, but you can make it do so by using opam install -t or setting the OPAMBUILDTEST environment variable in your local setup.

Installing the Travis CI scripts

Travis CI is a free service that enables continuous testing on your GitHub projects. It uses Ubuntu containers and runs the tests for at most 50 minutes per test run.

To use Travis CI with your OCaml project, you can follow the instructions on https://github.com/ocaml/ocaml-travisci-skeleton. Basically, this involves:

  • adding .travis.yml at the root of your project. You can tweak this file to test your project with different versions of OCaml. By default, it will use the latest stable version (today: 4.02.1, but it will be updated for each new compiler release). For every OCaml version that you want to test (supported values for <VERSION> are 3.12, 4.00, 4.01 and 4.02) add the line:
  • signing in at TravisCI using your GitHub account and enabling the tests for your project (click on the + button on the left pane).

And that's it, your project now has continuous integration, using the OPAM 1.2 pinning feature and Travis CI scripts.

Testing Optional Dependencies

By default, the script will not try to install the optional dependencies specified in your opam file. To do so, you need to manually specify which combination of optional dependencies you want to tests using the DEPOPTS environment variable. For instance, to test cohttp first with lwt, then with async and finally with both lwt and async (but only on the 4.01 compiler) you should write:

   - OCAML_VERSION=latest DEPOPTS=async
   - OCAML_VERSION=4.01   DEPOPTS="lwt async"

As usual, your contributions and feedback on this new feature are gladly welcome.

by Thomas Gazagnaire at December 18, 2014 12:00 AM

December 14, 2014


OCaml App for iOS 8.1 (Sources)

December 14, 2014

I coded up a simple OCaml iOS app to run in iOS 8.1. Instructions for downloading, building, and running the app are here:

Portland: Which Way Is Up on iOS?

You can download the sources directly here:

Portland 2.0.3, OCaml app for iOS 8.1 (29 KB)

This is a revamped version of Portland, the first example OCaml iOS app I made a few years ago. For maximum clarity it doesn’t do anything particularly impressive. It really just shows how to code an iOS app in OCaml.

Here are some things I learned while revamping.

  • Remember to call caml_main() in your main program (see main.m). If you forget, you’ll get the “undefined atom table” error at link time. I wrote about this in Undefined caml_atom_table.

  • If you keep disembodied OCaml values in the Objective C world, remember to register them as global roots using caml_register_global_root. Otherwise you’ll experience chaos at the first GC. I wrote about this in OCaml, Objective C, Rule 4.

  • Automatic Reference Counting imposes some restrictions on what you can do in wrapper code. For the Portland example (and probably for many real-world apps) it’s enough to have a table of Objective C objects that are conceptually referenced from OCaml. That is, the table in the Objective C world references the objects as a proxy for references from the OCaml world. You can see the code for this in wrap.m. I hope to write more about this. Maybe you, reader, have some ideas for a better approach.

  • Modern day iOS apps are based on View Controllers rather than on Views. In particular, it’s usual to define a custom subclass of UIViewController for each piece of the interface. This is tricky for OCaml on iOS, as it’s not (currently) possible to define an OCaml subclass of an Objective C class. For Portland I’m using an Objective C subclass of UIViewController that delegates to an OCaml object. Here too, this is probably good enough for many real-world apps. I hope to write more about this also.

  • There are several cyclic dependencies among the classes of Cocoa Touch used in Portland. To represent them in OCaml I use a common set of definitions named ui.mli, where the cycles can be accommodated using class type a =and b = … . It seems to me this is a strength of OCaml’s structural typing for objects. That is, it’s possible to define class types independently of particular classes. In this way cycles can be represented without forward-reference loopholes. (However it’s possible that the number of cycles in a full interface to Cocoa Touch would become overwhelming.)

It’s dark, chilly, and wet here by Puget Sound; I’m going to retire now to my tent and my dreams. The next thing on my OCaml-on-iOS schedule is to update to the latest OCaml compiler. I’m getting serious polymathic help on this, as I hope you’ll hear about soon.

If you have any trouble (or success) with the Portland app, or have any other comments, leave them below or email me at jeffsco@psellos.com.

Posted by: Jeffrey

<style type="text/css"> .flowaroundimg { float: left; margin: 0em 1em 0em 0em; } .rightoffloat li { position: relative; left: 1em; } pre { white-space: pre-wrap; width: 96%; margin-bottom: 24px; overflow: hidden; padding: 3px 10px; -webkit-border-radius: 3px; background-color: #fed; border: 1px solid #dcb; } pre code { white-space: pre-wrap; border: none; padding: 0; background-color: transparent; -webkit-border-radius: 0; } td { padding: 0em 1em 0em 1em; } th { padding: 0em 1em 0em 1em; } </style>

December 14, 2014 07:00 PM

December 13, 2014

OCamlCore Forge News

OCaml EFL 1.12.0 released

Major changes: - Moved to version 1.12 of the EFL/Elementary - Rewriting of the build toolchain (although ocamlbuild still do the compilation part) - The package adapts to the EFL/Elementary version of the user: An archive for each version of the EFL is not necessary any more. For example, if version 1.11 is installed, the interfaces to 1.12 specific functions will not be built but the library obtained will still be usable. - An experimental interraction with Lwt (available in an example) - Compiling with OCaml 4.02 should not create any warning. More precisely, the mutability of strings is not used any more. However, it is still possible to use the old OCaml 3.12.

by Alexis Bernadet at December 13, 2014 11:00 PM