Friday, July 4, 2008

Études en OCaml: Keeping your functions clean, Pt 2

Long, non-atomic functions are the bane of functional languages

For reasons I don't completely understand (but will try to discuss anyway), functional languages seem to raise the bar for code quality. In the interest of making life more exciting, functional languages also pile on the rewards when you do it right, and they'll conversely pie you in the face when you get it wrong.

You can accomplish a lot in a few lines of OCaml code, especially if what you're doing is well-suited to the functional model, and this creates special challenges for folks coming from C/C++. The tuple and variant matching features of the language make it fun and easy to violate separation of layers. C programmers will especially be delighted that their functions look so clean and short (only 20 lines!), and yet accomplish so much!

Alas, the actual usage of the function will inevitably be ugly, and you won't find that out until much later, when you have to refactor that 20 line function down to one or two wrapper functions and two or three 5-10 line functions. The problem with long functions is that it's difficult to reuse them, especially with the bread and butter fold_left and map functions. I hate to use buzzwords but I think it's accurate to say that your functions need to be *agile*.

The very idea of this starkly contrasts with how we write code in C/C++. Those primitive stone age tools limit how short one can practically write a function; the lack of built-in support for tuples and abstract datatypes means that the programming overhead of passing return values (via pointer arguments and structs) is fairly high.

As such, abusing the language is normal and expected for OCaml beginners, especially those coming from the imperative domain, so don't fret if you find yourself rewriting functions so that they're shorter. It is happening because you are learning.


Tools for writing clean functions

You can't truly fix a problem until you know why it happens, yes? If it's true that our predisposition for the detail-oriented is what keeps us from writing clean functions, then perhaps the fix is one of the oldest in the book: Think before you hack. If you're not scoffing, then here is one idea for abetting this. If you're not exactly sure what you're doing, write a comment-- one phrase --to describe what you want to do before you start writing your code. If you need another phrase, then you need another function.

I've collected a number of thoughts-- more concrete ones --that have also helped me when I was learning OCaml, and I hope you find them useful as well.


I have a hunch that clean code is not unlike a fractal...


OCaml-specific rules

- The broad 25-line rule

I pulled 25 out of my ass actually, but that's not really the point. The point of this exercise is to give beginners an idea of what OCaml code should look like. Readable OCaml code is typically composed of 5-20 line functions, or larger functions that are composed of subfunctions that are 3-13 lines long (assuming of course that you have spaces in between and not all 13 lines are actually code). Dispatches will be longer, but should not contain any logic of their own.

There will be many exceptions to this rule, particularly with purely imperative code like numerical computations, gui dispatchers and sysadmin-like tasks. OCaml has attracted a lot of numerical folks to its side and those jerks often like to remind me that none of my rules for keeping my code clean apply to them. It's important to identify purely imperative tasks like numerics, because they typically result in long functions no matter what you do. Once you have them identified, you can confine them into their own functions so that they won't muddle the rest of your code.


- Follow the types that you declare as layers of separation

The same H-M variant types that grant OCaml its incredible expressiveness are also useful guidelines for determining how you should write your functions. Fortunately for us, it's much easier to think in terms of what you want to hold in a type than it is to figure out where a function should begin and where it should end. In a nutshell, your functions may directly match against a particular type, but they shouldn't descend into the members of those types.



Good:

type entity =
metadata *
geometric_primitive

...

let render_geo_prim gp =
match gp with
TRIANGLE (a,b,c) -> ...

let render_entity (md,gp) =
render_geo_prim gp


Bad:

type entity =
metadata *
geometric_primitive

...

let render_entity (md,gp) =
match gp with
TRIANGLE (a,b,c) -> ...



I've tried applying this methodology at the module level. In a general sense, one could try to avoid descending into types that aren't declared in that module. I'm not sure if this works. Consider the following example. You have a graphics program. In this graphics program you have a type definition for the entities you want to draw, e.g. triangles, spheres, lines, meshes, etc. For the renderer you want the program to support a GL mode and a realtime-raytraced mode. Putting all of the code into the same file with the type definition would be a gross violation of layers, and will likely result in a single source file that'll be well over 100kb in size.

There are some types that you will want to expose the internals of. I term these types interface-types. Then there are some types that you expect to grow in the future, in ways that will totally mess with the pattern matching when you change the type spec. You typically want to hide the representations of the latter types, or perhaps consider using records instead of tuples. You can do this by writing a custom .mli header file for your source file, and then erasing the right hand side of the type definition. An easy way to do this is to pipe the output of `ocamlc -i`:

ocamlc -i blah.ml > blah.mli


blah.ml:
type stuff = wakka * float * int * string


blah.mli:
type stuff (* hide actual definition *)






Stuff that applies to OCaml and programming in general

-Resist any impulse to inline your code

I've heard many experienced and smart programmers tell me that I should inline a function because it's only got one caller. DO NOT DO THIS. Unless you are writing numerical computations or GL rendering routines, you *will* want the shorter function to List.map or Something.iter at some later point.

The jury is undecided on this in the realm of general programming. Although there are many good programmers who've told me that it's silly to split functions up, the best C/C++ programmers I know will typically do so. However, I care little for arguing how things are done with ancient fossils like C, so I leave the general applicability of this rule up to the reader.


- No more than 4 arguments for any given function

Why four? Cognitive psychologists have known for some time that people have a short term memory buffer of size 7, give or take 2. I'm sure many programmers will have the "max" of 9 (more if they've got Asperger's or autism which is perfectly possible), but you don't want your code to be only readable to the freaks (like yourselves) with photographic memory. Pushing the limits of your coworkers' memory isn't nice. Kindly leave them a free register.

"But I can't do with less than 5 independent arguments...", you say! If you *need* more than 4 independent, unrelated arguments for a function, it's possible that you need to write whatever it is you are doing in OO. If you need more than 4 arguments and they are not independent/unrelated, then perhaps you should be passing those values around as a tuple or as a record. I've found it helpful to try to understand why the OCaml designers implemented records, when there already tuples in the language. Functors are also sometimes useful for making code readable, and for keeping the argument count down.

It's not a bad idea to apply this analysis elsewhere. In OCaml, 1 line of code will often express 1 routine, as opposed to C, which often requires 10-20 lines to do something like traverse a list of any sort (even using library functions). Thus it's practical to have many functions that are only 3-7 lines long in OCaml.

Note that you'll have to amend this rule for older languages like C, where we commonly pass around 2-3 arguments that we barely need, because the language does not support tuples or OOP.


- No getters for the OO code

People with C++ and Java experience will revulse in horror at the idea that you can't make a class variable public, or even protected.

Having written OO code in several languages for some time now, I'm convinced that C++ and Java need to expose their class implementations and member variables because they lack functional expressiveness (e.g. tuples, abstract primitive datatypes, first class functions). In fact, one of the greatest epiphanies I've had came after I read that you generally should not have any getters at all. Try to do everything locally, without exposing any variables.

Ironically, I read the no-getters rule in a Python article somewhere, which itself sourced the tactic to an unnamed Java article! The idea applies to OCaml all the same. Not using any getters lives up to the spirit of what the OCaml OO restrictions are trying to accomplish: proper separation of layers in your OO code.

There will be corner-cases where the no-getters rule doesn't apply, but it's surprising how far you can get with this practice, and how much it improves your code. It also seems to be possible to shove all the variables that need getters into a separate class, usually called something like "SettingsForBlah".

Monday, June 30, 2008

Études en Ocaml: Keeping your functions clean, Pt 1

"But you could save a PUSH instruction if you inlined it..."

One of my coworkers, who is still somewhat fresh out of college, asked me why I had broken up one of my functions into a few smaller ones, when there was only one caller for each of them. One of the funnier things he suggested was that inlining the code would save an x86 push instruction. But then he surprised me, by suggesting that unnecessarily breaking up functions can make code more difficult to read.


Why is it so hard to keep functions short?

It's hard to keep functions short. If you do not think so, then you are probably lucky, or smart, or perhaps it isn't too difficult for you after years of hacking code, but that doesn't make it any less inherently difficult. It is still useful to understand why other people find it difficult to keep functions short. Indeed, why is it difficult at all?

Before we continue, I should point out that it's better to say that functions should be "atomic", or "clean". A good function is one that can't logically be broken down anymore without making it pointless. This small but important semantic difference has significance later on.

What do you see here?

I'm going to make a grand, sweeping assertion that many of you will likely say "duh" to. If you disagree, then it's probably because it doesn't apply to you, and that's great but we're trying to figure out why keeping functions clean and short is difficult for most programmers-- including myself --, so please bear with me.

Programmers are by and large detail-oriented. At the least, we're good at details, because we have to be. Computers are finicky contraptions that break unless everything is exactly correct. There is no "almost works" in much of what we do. Because of their digital and binary nature, computers and the tools we use to program them are not fault tolerant. It is left to the programmer to make sure everything is right, and because of this, anyone who does not enjoy picking apart the details of a problem probably won't enjoy programming.

So how did I come to this conclusion? It's quite simple. Back in school, we had two pretty darn tough "weed-out" courses. One course was taught in a high level language, and the other course was taught in C and assembly. Sounds like *your* university doesn't it? Now let me ask you real quick, did the majority of your CS friends find the C/Assembly class easier, or did they totally pwn that Scheme/Lisp/SML/Haskell class? Which one did they like more? Most importantly, which one was regarded as "more practical"?

Our undergraduate advisors noted that people generally found the C/Assembly class easier. Most CS students also liked that class more. The ones who didn't usually went on to grad school, which serves to further skew the pool of industry programmers in the direction of the detail-oriented.

Programmers in industry who maintain others' code are subjected to an additional set of biases. It takes some time to learn to read code, and reading code is a bottom-up process, especially if the said code is written poorly (now isn't it funny the way that *everyone* works somewhere where the code sucks?). If you're new to the field (like my coworker) and you're still plying the code line-by-line, it wouldn't seem that there'd be any point to breaking functions up at all. What's more is that sometimes people break up functions without adding any atomicity to the structure of their code.

Since programmers are detail-oriented people, writing atomic functions is not something that comes naturally because we do not typically encounter our problems from the top down. This is especially true if we learned to program in a low level, iterative language like C. Writing clean, atomic functions requires you to understand consciously what it is you are trying to do, at a high level.

A tree or a forest?

But many programmers don't consciously know what they're trying to do. They just hack towards a solution and somehow, magically, a working piece of code appears, both to their benefit and to their detriment.

The awesome and frightening conclusion that we gather from this is that it is hard for many of us to write clean code simply because we *can* hack code. Our ability to meticulously design algorithms, to read and hack other people's code (often at a line by line basis), to find needles in haystacks (debugging, anyone?), and perhaps our own personalities --these things are what made us into programmers in the first place. Details matter: above all, the code must work! "We can worry about how messy it is after we ship. ;)"

But these traits predispose us to a mindset where we do not see our problems in the big picture sense that is useful for writing clean/atomic functions, writing and using black boxes, and maintaining a separation of layers.
Indeed, the fact that we're told that we should write "short" functions is indicative of this mindset. Shortness has little semantic meaning. It is a detail, a direct metric, not an overarching goal like "cleanliness".