Friday, July 4, 2008

Études en OCaml: Keeping your functions clean, Pt 2

Long, non-atomic functions are the bane of functional languages

For reasons I don't completely understand (but will try to discuss anyway), functional languages seem to raise the bar for code quality. In the interest of making life more exciting, functional languages also pile on the rewards when you do it right, and they'll conversely pie you in the face when you get it wrong.

You can accomplish a lot in a few lines of OCaml code, especially if what you're doing is well-suited to the functional model, and this creates special challenges for folks coming from C/C++. The tuple and variant matching features of the language make it fun and easy to violate separation of layers. C programmers will especially be delighted that their functions look so clean and short (only 20 lines!), and yet accomplish so much!

Alas, the actual usage of the function will inevitably be ugly, and you won't find that out until much later, when you have to refactor that 20 line function down to one or two wrapper functions and two or three 5-10 line functions. The problem with long functions is that it's difficult to reuse them, especially with the bread and butter fold_left and map functions. I hate to use buzzwords but I think it's accurate to say that your functions need to be *agile*.

The very idea of this starkly contrasts with how we write code in C/C++. Those primitive stone age tools limit how short one can practically write a function; the lack of built-in support for tuples and abstract datatypes means that the programming overhead of passing return values (via pointer arguments and structs) is fairly high.

As such, abusing the language is normal and expected for OCaml beginners, especially those coming from the imperative domain, so don't fret if you find yourself rewriting functions so that they're shorter. It is happening because you are learning.


Tools for writing clean functions

You can't truly fix a problem until you know why it happens, yes? If it's true that our predisposition for the detail-oriented is what keeps us from writing clean functions, then perhaps the fix is one of the oldest in the book: Think before you hack. If you're not scoffing, then here is one idea for abetting this. If you're not exactly sure what you're doing, write a comment-- one phrase --to describe what you want to do before you start writing your code. If you need another phrase, then you need another function.

I've collected a number of thoughts-- more concrete ones --that have also helped me when I was learning OCaml, and I hope you find them useful as well.


I have a hunch that clean code is not unlike a fractal...


OCaml-specific rules

- The broad 25-line rule

I pulled 25 out of my ass actually, but that's not really the point. The point of this exercise is to give beginners an idea of what OCaml code should look like. Readable OCaml code is typically composed of 5-20 line functions, or larger functions that are composed of subfunctions that are 3-13 lines long (assuming of course that you have spaces in between and not all 13 lines are actually code). Dispatches will be longer, but should not contain any logic of their own.

There will be many exceptions to this rule, particularly with purely imperative code like numerical computations, gui dispatchers and sysadmin-like tasks. OCaml has attracted a lot of numerical folks to its side and those jerks often like to remind me that none of my rules for keeping my code clean apply to them. It's important to identify purely imperative tasks like numerics, because they typically result in long functions no matter what you do. Once you have them identified, you can confine them into their own functions so that they won't muddle the rest of your code.


- Follow the types that you declare as layers of separation

The same H-M variant types that grant OCaml its incredible expressiveness are also useful guidelines for determining how you should write your functions. Fortunately for us, it's much easier to think in terms of what you want to hold in a type than it is to figure out where a function should begin and where it should end. In a nutshell, your functions may directly match against a particular type, but they shouldn't descend into the members of those types.



Good:

type entity =
metadata *
geometric_primitive

...

let render_geo_prim gp =
match gp with
TRIANGLE (a,b,c) -> ...

let render_entity (md,gp) =
render_geo_prim gp


Bad:

type entity =
metadata *
geometric_primitive

...

let render_entity (md,gp) =
match gp with
TRIANGLE (a,b,c) -> ...



I've tried applying this methodology at the module level. In a general sense, one could try to avoid descending into types that aren't declared in that module. I'm not sure if this works. Consider the following example. You have a graphics program. In this graphics program you have a type definition for the entities you want to draw, e.g. triangles, spheres, lines, meshes, etc. For the renderer you want the program to support a GL mode and a realtime-raytraced mode. Putting all of the code into the same file with the type definition would be a gross violation of layers, and will likely result in a single source file that'll be well over 100kb in size.

There are some types that you will want to expose the internals of. I term these types interface-types. Then there are some types that you expect to grow in the future, in ways that will totally mess with the pattern matching when you change the type spec. You typically want to hide the representations of the latter types, or perhaps consider using records instead of tuples. You can do this by writing a custom .mli header file for your source file, and then erasing the right hand side of the type definition. An easy way to do this is to pipe the output of `ocamlc -i`:

ocamlc -i blah.ml > blah.mli


blah.ml:
type stuff = wakka * float * int * string


blah.mli:
type stuff (* hide actual definition *)






Stuff that applies to OCaml and programming in general

-Resist any impulse to inline your code

I've heard many experienced and smart programmers tell me that I should inline a function because it's only got one caller. DO NOT DO THIS. Unless you are writing numerical computations or GL rendering routines, you *will* want the shorter function to List.map or Something.iter at some later point.

The jury is undecided on this in the realm of general programming. Although there are many good programmers who've told me that it's silly to split functions up, the best C/C++ programmers I know will typically do so. However, I care little for arguing how things are done with ancient fossils like C, so I leave the general applicability of this rule up to the reader.


- No more than 4 arguments for any given function

Why four? Cognitive psychologists have known for some time that people have a short term memory buffer of size 7, give or take 2. I'm sure many programmers will have the "max" of 9 (more if they've got Asperger's or autism which is perfectly possible), but you don't want your code to be only readable to the freaks (like yourselves) with photographic memory. Pushing the limits of your coworkers' memory isn't nice. Kindly leave them a free register.

"But I can't do with less than 5 independent arguments...", you say! If you *need* more than 4 independent, unrelated arguments for a function, it's possible that you need to write whatever it is you are doing in OO. If you need more than 4 arguments and they are not independent/unrelated, then perhaps you should be passing those values around as a tuple or as a record. I've found it helpful to try to understand why the OCaml designers implemented records, when there already tuples in the language. Functors are also sometimes useful for making code readable, and for keeping the argument count down.

It's not a bad idea to apply this analysis elsewhere. In OCaml, 1 line of code will often express 1 routine, as opposed to C, which often requires 10-20 lines to do something like traverse a list of any sort (even using library functions). Thus it's practical to have many functions that are only 3-7 lines long in OCaml.

Note that you'll have to amend this rule for older languages like C, where we commonly pass around 2-3 arguments that we barely need, because the language does not support tuples or OOP.


- No getters for the OO code

People with C++ and Java experience will revulse in horror at the idea that you can't make a class variable public, or even protected.

Having written OO code in several languages for some time now, I'm convinced that C++ and Java need to expose their class implementations and member variables because they lack functional expressiveness (e.g. tuples, abstract primitive datatypes, first class functions). In fact, one of the greatest epiphanies I've had came after I read that you generally should not have any getters at all. Try to do everything locally, without exposing any variables.

Ironically, I read the no-getters rule in a Python article somewhere, which itself sourced the tactic to an unnamed Java article! The idea applies to OCaml all the same. Not using any getters lives up to the spirit of what the OCaml OO restrictions are trying to accomplish: proper separation of layers in your OO code.

There will be corner-cases where the no-getters rule doesn't apply, but it's surprising how far you can get with this practice, and how much it improves your code. It also seems to be possible to shove all the variables that need getters into a separate class, usually called something like "SettingsForBlah".