Closures, and when they’re useful.
Friday, May 20th, 2011
When is a closure useful?
Before we start with why a closure is useful, we might first need to understand what exactly a closure is.
First-class functions
In order to understand what a closure is, we must realize that in many, if not most, languages we can not just call functions, but we can also pass references to a function around in a variable. If a language supports that, it is said to have first-class functions. This can be used, amongst other things, to implement callbacks: you pass a reference to a function to a part of the program, which can then later call the function and obtain the results.
A common example of something that uses callback functions is a sorting routine that takes a comparison function. Such a function is called a higher-order function. For instance, Python’s sorted function:
sorted(iterable, cmp=None, key=None, reverse=False) --> new sorted list
The cmp parameter is a callback function. If we have a list of custom objects:
class MyPerson(): def __init__(name, age): self.name = name self.age = age people = [ MyPerson('john', 24), MyPerson('santa', 100'), MyPerson('pete', 30), ]
and we want to sort people by age, we can do so by defining our own custom comparison function and pass it to sorted:
def my_cmp(a, b): return(cmp(a.age, b.age)) sorted(people, my_cmp)
The sorted function will now loop through the items in people and call the callback function my_cmp for two items in the list at a time. If one is bigger/smaller than the other, it swaps them in order to sort people. Note that we are not calling my_cmp! We’re simply passing a reference to the function to sorted.
Nested functions
Okay, so that covers first-class functions. Many languages also support nested functions. Example:
def get_cmp_func(key='age'): def my_cmp_name(a, b): return(cmp(a.name, b.name)) def my_cmp_age(a, b): return(cmp(a.age, b.age)) if key == 'name': return my_cmp_name elif key == 'age': return my_cmp_age
The get_cmp_func returns a function that can be used to compare things depending on what you pass as the key parameter. get_cmp_func is also a higher-order function because it returns a reference to a function. Of course in this use-case there are better ways of sorting the list, but it’s just an example.
Anonymous functions
Anonymous functions are not a requirement for closures, but it may be a good idea to explain what they are nonetheless, as there’s a lot of confusion over when exactly something is an anonymous function.
Anonymous functions, sometimes also called lambda’s, are simply that: anonymous. They have no name. Looking at previous examples in this post, we see function names such as my_cmp, get_cmp_func and even nested functions with names: my_cmp_age. Anonymous functions have no name. That doesn’t mean they can’t be passed around as a reference though! Example:
sorted(people, lambda a, b: cmp(a.age, b.age))
The anonymous function here is: lambda a, b: cmp(a.age, b.age). As you can see, it looks a lot like our first my_cmp function, except it has no name and doesn’t seem to return anything. That’s because an anonymous (lambda) function in Python always implicitly returns its first statement. In fact, you can only have one statement in a lambda in Python. (Other languages allow for more advanced anonymous functions; Python likes to keep it simple).
Okay, so why exactly would you need anonymous functions? Well, if your language already supports first-class functions (passing around references to a function), there really isn’t a need for anonymous functions, except that it saves some typing. Lambda functions are syntactic sugar for first-class functions.
Scope
So.. a closure, what is it? Again, before we can understand closures, we need to understand scope. Scope determines when we can access defined variables and functions at a certain location in our code. When a function is called, the programming language allocates a piece of memory where parameters to the function are stored and local variables can be stored by the function. This piece of memory (called the stack) is automatically cleared when the function returns. This is called the local scope.
Functions usually can also reference variable of the parent scope. For example:
a = 10 def print_a(): print a print_a() # output: 10
The print_a function has access to the a variable in the parent scope. But if we define a in a function’s local scope, we’ll get an error:
def define_a(): a = 10 def print_a(): print a define_a() print_a() # NameError: global name 'a' is not defined
We get a NameError when we try to print a’s value, because it is defined in define_a‘s local scope, which will be destroyed as soon as define_a stops running. This is called going out of scope. Anything a piece of code can access (local scope, parent scope) is defined as being within scope.
Closures
Now, finally, closures!
A closure is a special way in which scopes are handled. Instead of a function going out of scope and all the variables/functions its scope (both the local, as the parent, as the grand-parent, etc scope) being destroyed, the scope is kept around for later usage. Let’s look at an example:
def define_a(): a = 10 def print_a(): print a return(print_a) var_print_a = define_a() var_print_a() # output: 10
This outputs 10. Let’s take a look at what’s happening. We define a function define_a and set a = 10 in its local scope. We then define a nested function that prints a from the parent scope. The define_a function then returns a reference to that function.
Next, we call define_a, which returns a reference to print_a and assigns it to variable var_print_a. Then we call var_print_a as a function (this is called dereferencing). By all accounts it shouldn’t work, because define_a has already stopped running. It has gone out of scope and its scope (containing a) should have been destroyed. But it’s not, because Python kept its scope around. This is a closure. The variables that were in scope at the time the closure was generated are still accessible for the function, and are now known as free variables.
The use-case
So, when are closures useful? Why not just use an Object and store the value in the object, along with a method that uses the object?
Let’s say we have a multithreaded program that handles requests. Data is stored in a database. The request handlers need to access the data in the database, but each thread has to have its own handler to the database, or they might accidentally overwrite each other’s data. So our multithreaded program allows us to register a callback function which will be called when a new thread starts. The callback function should return a new database connection for use in the thread.
def make_db_connection(): return(db.conn(host='localhost', username='john', passwd='f00b4r')) app = MyMultiThreadedApp(on_new_thread_cb = make_db_connection) app.serve()
MyMultiThreadedApp will call make_db_connection for each new thread it starts, and the thread can then use the database connection returned by make_db_connection. But there is a problem! The database connection information (host, username, passwd) is hard-coded, but we want to get it from a configuration file instead!
So? We just pass some paramters to the make_db_connection right? Wrong!
def make_db_connection(host, username, passwd): return(db.conn(host=host, username=username, passwd=passwd)) app = MyMultiThreadedApp(on_new_thread_cb = make_db_connection) app.serve()
This example wont work! Why not? Because MyMultiThreadedApp has absolutely no idea it should pass parameters to make_db_connection. Remember that we’re not calling the function ourselves, we’re just passing a reference to the MyMultiThreadedApp, which will call it eventually. There’s no way for it to know which parameters it should pass, because that depends on how your database needs to be set up. SQLite only needs a path parameter, but MySQL also needs username, password, and a host.
This is where closures step in:
def gen_db_connector(host, username, passwd): def make_db_connection(): return(db.conn(host=host, username=username, passwd=passwd)) return(make_db_connection) callback_func = gen_db_connector('localhost', 'john', 'f00b4r') app = MyMultiThreadedApp(on_new_thread_cb = callback_func) app.serve()
The gen_db_connector function generates a closure (make_db_connection) which has access to host, username and passwd. We then get a reference to the closure, put it in callback_func and pass that to MyMultiThreadedApp. Now when a new thread is created, and the callback function is called, it will have access to the host, username and passwd information, without MyMultiThreadedApp needing to know which params it should pass on.
An alternative to closures
There’s a different way of accomplishing this though. By using objects:
class DBConnector(): def __init__(self, host, username, passwd): self.host = host self.username = username self.passwd = passwd def connect(self): return(db.conn( host=self.host, username=self.username, passwd=self.passwd) ) db_conn = DBConnector('localhost', 'john', 'f00b4r') app = MyMultiThreadedApp(on_new_thread_cb = db_conn.connect) app.serve()
However, this is a lot more lines, and wheter it works depends on if your programming language allows first-class methods. That is, passing references around to methods on an object, while also allowing you to call them as an instance method (instead of just as a static method).
I’d personally argue for the Object way. Closures are a concept which is very hard to understand for less experienced programmers. It is a matter of debate on whether closures hide state in an unpredictable way. I tend to think they do, and I’m not much of a fan of free variables since it is hard to guess where they came from. At any rate, objects are easier to understand than closures, so if at all possible, go for the object-way.