KEMBAR78
Programming in Lua 4th Edition-Trang-2 | PDF | Scope (Computer Science) | Variable (Computer Science)
0% found this document useful (0 votes)
16 views70 pages

Programming in Lua 4th Edition-Trang-2

Chapter 8 of the document discusses Lua's local variables and control structures, emphasizing the importance of local variables for scope management and performance. It covers various control structures such as if statements, while loops, repeat-until loops, and for loops, explaining their syntax and usage. Additionally, it introduces the break, return, and goto statements, highlighting their roles in controlling program flow and providing examples of state machines.

Uploaded by

ca can
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views70 pages

Programming in Lua 4th Edition-Trang-2

Chapter 8 of the document discusses Lua's local variables and control structures, emphasizing the importance of local variables for scope management and performance. It covers various control structures such as if statements, while loops, repeat-until loops, and for loops, explaining their syntax and usage. Additionally, it introduces the break, return, and goto statements, highlighting their roles in controlling program flow and providing examples of state machines.

Uploaded by

ca can
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 70

Chapter 8.

Filling some Gaps


We have already used most of Lua's syntactical constructions in previous examples, but it is easy to miss
some details. For completeness, this chapter closes the first part of the book with more details about them.

Local Variables and Blocks


By default, variables in Lua are global. All local variables must be declared as such. Unlike global vari-
ables, a local variable has its scope limited to the block where it is declared. A block is the body of a control
structure, the body of a function, or a chunk (the file or string where the variable is declared):

x = 10
local i = 1 -- local to the chunk

while i <= x do
local x = i * 2 -- local to the while body
print(x) --> 2, 4, 6, 8, ...
i = i + 1
end

if i > 20 then
local x -- local to the "then" body
x = 20
print(x + 2) -- (would print 22 if test succeeded)
else
print(x) --> 10 (the global one)
end

print(x) --> 10 (the global one)

Beware that this last example will not work as expected if you enter it in interactive mode. In interactive
mode, each line is a chunk by itself (unless it is not a complete command). As soon as you enter the
second line of the example (local i = 1), Lua runs it and starts a new chunk in the next line. By
then, the local declaration is already out of scope. To solve this problem, we can delimit the whole block
explicitly, bracketing it with the keywords do–end. Once you enter the do, the command completes only
at the corresponding end, so Lua will not execute each line by itself.

These do blocks are useful also when we need finer control over the scope of some local variables:

local x1, x2
do
local a2 = 2*a
local d = (b^2 - 4*a*c)^(1/2)
x1 = (-b + d)/a2
x2 = (-b - d)/a2
end -- scope of 'a2' and 'd' ends here
print(x1, x2) -- 'x1' and 'x2' still in scope

It is good programming style to use local variables whenever possible. Local variables avoid cluttering
the global environment with unnecessary names; they also avoid name clashes between different parts of
a program. Moreover, the access to local variables is faster than to global ones. Finally, a local variable
vanishes as soon as its scope ends, allowing the garbage collector to release its value.

57
Filling some Gaps

Given that local variables are “better” than global ones, some people argue that Lua should use local
by default. However, local by default has its own set of problems (e.g., issues with accessing non-local
variables). A better approach would be no default, that is, all variables should be declared before used.
The Lua distribution comes with a module strict.lua for global-variable checks; it raises an error if
we try to assign to a non-existent global inside a function or to use a non-existent global. It is a good habit
to use it when developing Lua code.

Each local declaration can include an initial assignment, which works the same way as a conventional
multiple assignment: extra values are thrown away, extra variables get nil. If a declaration has no initial
assignment, it initializes all its variables with nil:

local a, b = 1, 10
if a < b then
print(a) --> 1
local a -- '= nil' is implicit
print(a) --> nil
end -- ends the block started at 'then'
print(a, b) --> 1 10

A common idiom in Lua is

local foo = foo

This code creates a local variable, foo, and initializes it with the value of the global variable foo. (The
local foo becomes visible only after its declaration.) This idiom is useful to speed up the access to foo. It
is also useful when the chunk needs to preserve the original value of foo even if later some other function
changes the value of the global foo; in particular, it makes the code resistant to monkey patching. Any
piece of code preceded by local print = print will use the original function print even if
print is monkey patched to something else.

Some people think it is a bad practice to use declarations in the middle of a block. Quite the opposite:
by declaring a variable only when we need it, we seldom need to declare it without an initial value (and
therefore we seldom forget to initialize it). Moreover, we shorten the scope of the variable, which increases
readability.

Control Structures
Lua provides a small and conventional set of control structures, with if for conditional execution and
while, repeat, and for for iteration. All control structures have a syntax with an explicit terminator: end
terminates if, for and while structures; until terminates repeat structures.

The condition expression of a control structure can result in any value. Remember that Lua treats as true
all values different from false and nil. (In particular, Lua treats both zero and the empty string as true.)

if then else
An if statement tests its condition and executes its then-part or its else-part accordingly. The else-part
is optional.

if a < 0 then a = 0 end

if a < b then return a else return b end

if line > MAXLINES then

58
Filling some Gaps

showpage()
line = 0
end

To write nested ifs we can use elseif. It is similar to an else followed by an if, but it avoids the need for
multiple ends:

if op == "+" then
r = a + b
elseif op == "-" then
r = a - b
elseif op == "*" then
r = a*b
elseif op == "/" then
r = a/b
else
error("invalid operation")
end

Because Lua has no switch statement, such chains are somewhat common.

while
As the name implies, a while loop repeats its body while a condition is true. As usual, Lua first tests the
while condition; if the condition is false, then the loop ends; otherwise, Lua executes the body of the loop
and repeats the process.

local i = 1
while a[i] do
print(a[i])
i = i + 1
end

repeat
As the name implies, a repeat–until statement repeats its body until its condition is true. This statement
does the test after the body, so that it always executes the body at least once.

-- print the first non-empty input line


local line
repeat
line = io.read()
until line ~= ""
print(line)

Differently from most other languages, in Lua the scope of a local variable declared inside the loop includes
the condition:

-- computes the square root of 'x' using Newton-Raphson method


local sqr = x / 2
repeat
sqr = (sqr + x/sqr) / 2
local error = math.abs(sqr^2 - x)
until error < x/10000 -- local 'error' still visible here

59
Filling some Gaps

Numerical for
The for statement has two variants: the numerical for and the generic for.

A numerical for has the following syntax:

for var = exp1, exp2, exp3 do


something
end

This loop will execute something for each value of var from exp1 to exp2, using exp3 as the step
to increment var. This third expression is optional; when absent, Lua assumes one as the step value. If
we want a loop without an upper limit, we can use the constant math.huge:

for i = 1, math.huge do
if (0.3*i^3 - 20*i^2 - 500 >= 0) then
print(i)
break
end
end

The for loop has some subtleties that you should learn in order to make good use of it. First, all three
expressions are evaluated once, before the loop starts. Second, the control variable is a local variable
automatically declared by the for statement, and it is visible only inside the loop. A typical mistake is to
assume that the variable still exists after the loop ends:

for i = 1, 10 do print(i) end


max = i -- probably wrong!

If you need the value of the control variable after the loop (usually when you break the loop), you must
save its value into another variable:

-- find a value in a list


local found = nil
for i = 1, #a do
if a[i] < 0 then
found = i -- save value of 'i'
break
end
end
print(found)

Third, you should not change the value of the control variable: the effect of such changes is unpredictable.
If you want to end a for loop before its normal termination, use break (as we did in the previous example).

Generic for
The generic for loop traverses all values returned by an iterator function. We saw some examples already,
with pairs, ipairs, io.lines, etc. Despite its apparent simplicity, the generic for is powerful. With
proper iterators, we can traverse almost anything in a readable fashion.

Of course, we can write our own iterators. Although the use of the generic for is easy, the task of writing
iterator functions has its subtleties; hence, we will cover this topic later, in Chapter 18, Iterators and the
Generic for.

60
Filling some Gaps

Unlike the numerical for, the generic for can have multiple variables, which are all updated at each itera-
tion. The loop stops when the first variable gets nil. As in the numerical loop, the loop variables are local
to the loop body and you should not change their values inside each iteration.

break, return, and goto


The break and return statements allow us to jump out of a block. The goto statement allows us to jump
to almost any point in a function.

We use the break statement to finish a loop. This statement breaks the inner loop (for, repeat, or while)
that contains it; it cannot be used outside a loop. After the break, the program continues running from the
point immediately after the broken loop.

A return statement returns the results from a function or simply finishes the function. There is an implicit
return at the end of any function, so we do not need to write one for functions that end naturally, without
returning any value.

For syntactic reasons, a return can appear only as the last statement of a block: in other words, as the
last statement in our chunk or just before an end, an else, or an until. For instance, in the next example,
return is the last statement of the then block:

local i = 1
while a[i] do
if a[i] == v then return i end
i = i + 1
end

Usually, these are the places where we use a return, because any statement following it would be un-
reachable. Sometimes, however, it may be useful to write a return in the middle of a block; for instance,
we may be debugging a function and want to avoid its execution. In such cases, we can use an explicit
do block around the statement:

function foo ()
return --<< SYNTAX ERROR
-- 'return' is the last statement in the next block
do return end -- OK
other statements
end

A goto statement jumps the execution of a program to a corresponding label. There has been a long going
debate about goto, with some people arguing even today that they are harmful to programming and should
be banned from programming languages. Nonetheless, several current languages offer goto, with good
reason. They are a powerful mechanism and, when used with care, can only improve the quality of our
code.

In Lua, the syntax for a goto statement is quite conventional: it is the reserved word goto followed by the
label name, which can be any valid identifier. The syntax for a label is a little more convoluted: it has two
colons followed by the label name followed by more two colons, like in ::name::. This convolution is
intentional, to highlight labels in a program.

Lua poses some restrictions to where we can jump with a goto. First, labels follow the usual visibility
rules, so we cannot jump into a block (because a label inside a block is not visible outside it). Second, we
cannot jump out of a function. (Note that the first rule already excludes the possibility of jumping into a
function.) Third, we cannot jump into the scope of a local variable.

61
Filling some Gaps

A typical and well-behaved use of a goto is to simulate some construction that you learned from another
language but that is absent from Lua, such as continue, multi-level break, multi-level continue, redo, local
error handling, etc. A continue statement is simply a goto to a label at the end of a loop block; a redo
statement jumps to the beginning of the block:

while some_condition do
::redo::
if some_other_condition then goto continue
else if yet_another_condition then goto redo
end
some code
::continue::
end

A useful detail in the specification of Lua is that the scope of a local variable ends on the last non-void
statement of the block where the variable is defined; labels are considered void statements. To see the
usefulness of this detail, consider the next fragment:

while some_condition do
if some_other_condition then goto continue end
local var = something
some code
::continue::
end

You may think that this goto jumps into the scope of the variable var. However, the continue label
appears after the last non-void statement of the block, and therefore it is not inside the scope of var.

The goto is also useful for writing state machines. As an example, Figure 8.1, “An example of a state
machine with goto” shows a program that checks whether its input has an even number of zeros.

Figure 8.1. An example of a state machine with goto


::s1:: do
local c = io.read(1)
if c == '0' then goto s2
elseif c == nil then print'ok'; return
else goto s1
end
end

::s2:: do
local c = io.read(1)
if c == '0' then goto s1
elseif c == nil then print'not ok'; return
else goto s2
end
end

goto s1

There are better ways to write this specific program, but this technique is useful if we want to translate a
finite automaton into Lua code automatically (think about dynamic code generation).

As another example, let us consider a simple maze game. The maze has several rooms, each with up to
four doors: north, south, east, and west. At each step, the user enters a movement direction. If there is a

62
Filling some Gaps

door in this direction, the user goes to the corresponding room; otherwise, the program prints a warning.
The goal is to go from an initial room to a final room.

This game is a typical state machine, where the current room is the state. We can implement this maze
with one block for each room, using a goto to move from one room to another. Figure 8.2, “A maze game”
shows how we could write a small maze with four rooms.

Figure 8.2. A maze game


goto room1 -- initial room

::room1:: do
local move = io.read()
if move == "south" then goto room3
elseif move == "east" then goto room2
else
print("invalid move")
goto room1 -- stay in the same room
end
end

::room2:: do
local move = io.read()
if move == "south" then goto room4
elseif move == "west" then goto room1
else
print("invalid move")
goto room2
end
end

::room3:: do
local move = io.read()
if move == "north" then goto room1
elseif move == "east" then goto room4
else
print("invalid move")
goto room3
end
end

::room4:: do
print("Congratulations, you won!")
end

For this simple game, you may find that a data-driven program, where you describe the rooms and move-
ments with tables, is a better design. However, if the game has several special situations in each room,
then this state-machine design is quite appropriate.

Exercises
Exercise 8.1: Most languages with a C-like syntax do not offer an elseif construct. Why does Lua need
this construct more than those languages?

63
Filling some Gaps

Exercise 8.2: Describe four different ways to write an unconditional loop in Lua. Which one do you prefer?

Exercise 8.3: Many people argue that repeat--until is seldom used, and therefore it should not be present
in a minimalistic language like Lua. What do you think?

Exercise 8.4: As we saw in the section called “Proper Tail Calls”, a tail call is a goto in disguise. Using
this idea, reimplement the simple maze game from the section called “break, return, and goto” using tail
calls. Each block should become a new function, and each goto becomes a tail call.

Exercise 8.5: Can you explain why Lua has the restriction that a goto cannot jump out of a function? (Hint:
how would you implement that feature?)

Exercise 8.6: Assuming that a goto could jump out of a function, explain what the program in Figure 8.3,
“A strange (and invalid) use of a goto” would do.

Figure 8.3. A strange (and invalid) use of a goto


function getlabel ()
return function () goto L1 end
::L1::
return 0
end

function f (n)
if n == 0 then return getlabel()
else
local res = f(n - 1)
print(n)
return res
end
end

x = f(10)
x()

(Try to reason about the label using the same scoping rules used for local variables.)

64
Part II. Real Programming
Table of Contents
9. Closures ....................................................................................................................... 68
Functions as First-Class Values .................................................................................... 68
Non-Global Functions ................................................................................................ 69
Lexical Scoping ........................................................................................................ 71
A Taste of Functional Programming ............................................................................. 74
10. Pattern Matching .......................................................................................................... 77
The Pattern-Matching Functions ................................................................................... 77
The function string.find .............................................................................. 77
The function string.match ............................................................................ 77
The function string.gsub .............................................................................. 78
The function string.gmatch .......................................................................... 78
Patterns .................................................................................................................... 78
Captures ................................................................................................................... 82
Replacements ............................................................................................................ 83
URL encoding .................................................................................................. 84
Tab expansion ................................................................................................... 86
Tricks of the Trade .................................................................................................... 86
11. Interlude: Most Frequent Words ..................................................................................... 90
12. Date and Time ............................................................................................................. 92
The Function os.time ............................................................................................. 92
The Function os.date ............................................................................................. 93
Date–Time Manipulation ............................................................................................ 95
13. Bits and Bytes ............................................................................................................. 97
Bitwise Operators ...................................................................................................... 97
Unsigned Integers ...................................................................................................... 97
Packing and Unpacking Binary Data ............................................................................. 99
Binary files ............................................................................................................. 101
14. Data Structures ........................................................................................................... 104
Arrays .................................................................................................................... 104
Matrices and Multi-Dimensional Arrays ....................................................................... 105
Linked Lists ............................................................................................................ 107
Queues and Double-Ended Queues ............................................................................. 107
Reverse Tables ........................................................................................................ 108
Sets and Bags ......................................................................................................... 109
String Buffers ......................................................................................................... 110
Graphs ................................................................................................................... 111
15. Data Files and Serialization .......................................................................................... 114
Data Files ............................................................................................................... 114
Serialization ............................................................................................................ 116
Saving tables without cycles .............................................................................. 118
Saving tables with cycles .................................................................................. 119
16. Compilation, Execution, and Errors ............................................................................... 122
Compilation ............................................................................................................ 122
Precompiled Code .................................................................................................... 125
Errors ..................................................................................................................... 126
Error Handling and Exceptions .................................................................................. 127
Error Messages and Tracebacks .................................................................................. 128
17. Modules and Packages ................................................................................................ 131
The Function require ............................................................................................ 132
Renaming a module ......................................................................................... 133
Path searching ................................................................................................. 133

66
Real Programming

Searchers ........................................................................................................ 135


The Basic Approach for Writing Modules in Lua .......................................................... 135
Submodules and Packages ......................................................................................... 137

67
Chapter 9. Closures
Functions in Lua are first-class values with proper lexical scoping.

What does it mean for functions to be “first-class values”? It means that, in Lua, a function is a value with
the same rights as more conventional values like numbers and strings. A program can store functions in
variables (both global and local) and in tables, pass functions as arguments to other functions, and return
functions as results.

What does it mean for functions to have “lexical scoping”? It means that functions can access variables of
their enclosing functions. (It also means that Lua properly contains the lambda calculus.)

Together, these two features give great flexibility to the language; for instance, a program can redefine a
function to add new functionality or erase a function to create a secure environment when running a piece
of untrusted code (such as code received through a network). More importantly, these features allow us
to apply in Lua many powerful programming techniques from the functional-language world. Even if you
have no interest at all in functional programming, it is worth learning a little about how to explore these
techniques, because they can make your programs smaller and simpler.

Functions as First-Class Values


As we just saw, functions in Lua are first-class values. The following example illustrates what that means:

a = {p = print} -- 'a.p' refers to the 'print' function


a.p("Hello World") --> Hello World
print = math.sin -- 'print' now refers to the sine function
a.p(print(1)) --> 0.8414709848079
math.sin = a.p -- 'sin' now refers to the print function
math.sin(10, 20) --> 10 20

If functions are values, are there expressions that create functions? Sure. In fact, the usual way to write
a function in Lua, such as

function foo (x) return 2*x end

is an instance of what we call syntactic sugar; it is simply a pretty way to write the following code:

foo = function (x) return 2*x end

The expression in the right-hand side of the assignment (function (x) body end) is a function
constructor, in the same way that {} is a table constructor. Therefore, a function definition is in fact a
statement that creates a value of type "function" and assigns it to a variable.

Note that, in Lua, all functions are anonymous. Like any other value, they do not have names. When
we talk about a function name, such as print, we are actually talking about a variable that holds that
function. Although we often assign functions to global variables, giving them something like a name, there
are several occasions when functions remain anonymous. Let us see some examples.

The table library provides the function table.sort, which receives a table and sorts its elements. Such
a function must allow unlimited variations in the sort order: ascending or descending, numeric or alpha-
betical, tables sorted by a key, and so on. Instead of trying to provide all kinds of options, sort provides
a single optional parameter, which is the order function: a function that takes two elements and returns

68
Closures

whether the first must come before the second in the sorted list. For instance, suppose we have a table
of records like this:

network = {
{name = "grauna", IP = "210.26.30.34"},
{name = "arraial", IP = "210.26.30.23"},
{name = "lua", IP = "210.26.23.12"},
{name = "derain", IP = "210.26.23.20"},
}

If we want to sort the table by the field name, in reverse alphabetical order, we just write this:

table.sort(network, function (a,b) return (a.name > b.name) end)

See how handy the anonymous function is in this statement.

A function that takes another function as an argument, such as sort, is what we call a higher-order func-
tion. Higher-order functions are a powerful programming mechanism, and the use of anonymous functions
to create their function arguments is a great source of flexibility. Nevertheless, remember that higher-order
functions have no special rights; they are a direct consequence of the fact that Lua handles functions as
first-class values.

To further illustrate the use of higher-order functions, we will write a naive implementation of a common
higher-order function, the derivative. In an informal definition, the derivative of a function f is the function
f'(x) = (f(x + d) - f(x)) / d when d becomes infinitesimally small. According to this definition, we can
compute an approximation of the derivative as follows:

function derivative (f, delta)


delta = delta or 1e-4
return function (x)
return (f(x + delta) - f(x))/delta
end
end

Given a function f, the call derivative(f) returns (an approximation of) its derivative, which is
another function:

c = derivative(math.sin)
> print(math.cos(5.2), c(5.2))
--> 0.46851667130038 0.46856084325086
print(math.cos(10), c(10))
--> -0.83907152907645 -0.83904432662041

Non-Global Functions
An obvious consequence of first-class functions is that we can store functions not only in global variables,
but also in table fields and in local variables.

We have already seen several examples of functions in table fields: most Lua libraries use this mechanism
(e.g., io.read, math.sin). To create such functions in Lua, we only have to put together what we
have learned so far:

Lib = {}

69
Closures

Lib.foo = function (x,y) return x + y end


Lib.goo = function (x,y) return x - y end

print(Lib.foo(2, 3), Lib.goo(2, 3)) --> 5 -1

Of course, we can also use constructors:

Lib = {
foo = function (x,y) return x + y end,
goo = function (x,y) return x - y end
}

Moreover, Lua offers a specific syntax to define such functions:

Lib = {}
function Lib.foo (x,y) return x + y end
function Lib.goo (x,y) return x - y end

As we will see in Chapter 21, Object-Oriented Programming, the use of functions in table fields is a key
ingredient for object-oriented programming in Lua.

When we store a function into a local variable, we get a local function, that is, a function that is restricted
to a given scope. Such definitions are particularly useful for packages: because Lua handles each chunk as
a function, a chunk can declare local functions, which are visible only inside the chunk. Lexical scoping
ensures that other functions in the chunk can use these local functions.

Lua supports such uses of local functions with a syntactic sugar for them:

local function f (params)


body
end

A subtle point arises in the definition of recursive local functions, because the naive approach does not
work here. Consider the next definition:

local fact = function (n)


if n == 0 then return 1
else return n*fact(n-1) -- buggy
end
end

When Lua compiles the call fact(n - 1) in the function body, the local fact is not yet defined.
Therefore, this expression will try to call a global fact, not the local one. We can solve this problem by
first defining the local variable and then the function:

local fact
fact = function (n)
if n == 0 then return 1
else return n*fact(n-1)
end
end

Now the fact inside the function refers to the local variable. Its value when the function is defined does
not matter; by the time the function executes, fact already has the right value.

70
Closures

When Lua expands its syntactic sugar for local functions, it does not use the naive definition. Instead, a
definition like

local function foo (params) body end

expands to

local foo; foo = function (params) body end

Therefore, we can use this syntax for recursive functions without worrying.

Of course, this trick does not work if we have indirect recursive functions. In such cases, we must use the
equivalent of an explicit forward declaration:

local f -- "forward" declaration

local function g ()
some code f() some code
end

function f ()
some code g() some code
end

Beware not to write local in the last definition. Otherwise, Lua would create a fresh local variable f,
leaving the original f (the one that g is bound to) undefined.

Lexical Scoping
When we write a function enclosed in another function, it has full access to local variables from the en-
closing function; we call this feature lexical scoping. Although this visibility rule may sound obvious, it
is not. Lexical scoping plus nested first-class functions give great power to a programming language, but
many do not support the combination.

Let us start with a simple example. Suppose we have a list of student names and a table that maps names
to grades; we want to sort the list of names according to their grades, with higher grades first. We can
do this task as follows:

names = {"Peter", "Paul", "Mary"}


grades = {Mary = 10, Paul = 7, Peter = 8}
table.sort(names, function (n1, n2)
return grades[n1] > grades[n2] -- compare the grades
end)

Now, suppose we want to create a function to do this task:

function sortbygrade (names, grades)


table.sort(names, function (n1, n2)
return grades[n1] > grades[n2] -- compare the grades
end)
end

The interesting point in this last example is that the anonymous function given to sort accesses grades,
which is a parameter to the enclosing function sortbygrade. Inside this anonymous function, grades

71
Closures

is neither a global variable nor a local variable, but what we call a non-local variable. (For historical
reasons, non-local variables are also called upvalues in Lua.)

Why is this point so interesting? Because functions, being first-class values, can escape the original scope
of their variables. Consider the following code:

function newCounter ()
local count = 0
return function () -- anonymous function
count = count + 1
return count
end
end

c1 = newCounter()
print(c1()) --> 1
print(c1()) --> 2

In this code, the anonymous function refers to a non-local variable (count) to keep its counter. However,
by the time we call the anonymous function, the variable count seems to be out of scope, because the
function that created this variable (newCounter) has already returned. Nevertheless, Lua handles this
situation correctly, using the concept of closure. Simply put, a closure is a function plus all it needs to
access non-local variables correctly. If we call newCounter again, it will create a new local variable
count and a new closure, acting over this new variable:

c2 = newCounter()
print(c2()) --> 1
print(c1()) --> 3
print(c2()) --> 2

So, c1 and c2 are different closures. Both are built over the same function, but each acts upon an inde-
pendent instantiation of the local variable count.

Technically speaking, what is a value in Lua is the closure, not the function. The function itself is a kind
of a prototype for closures. Nevertheless, we will continue to use the term “function” to refer to a closure
whenever there is no possibility for confusion.

Closures provide a valuable tool in many contexts. As we have seen, they are useful as arguments to high-
er-order functions such as sort. Closures are valuable for functions that build other functions too, like
our newCounter example or the derivative example; this mechanism allows Lua programs to incorpo-
rate sophisticated programming techniques from the functional world. Closures are useful for callback
functions, too. A typical example here occurs when we create buttons in a conventional GUI toolkit. Each
button has a callback function to be called when the user presses the button; we want different buttons to
do slightly different things when pressed.

For instance, a digital calculator needs ten similar buttons, one for each digit. We can create each of them
with a function like this:

function digitButton (digit)


return Button{ label = tostring(digit),
action = function ()
add_to_display(digit)
end
}

72
Closures

end

In this example, we pretend that Button is a toolkit function that creates new buttons; label is the
button label; and action is the callback function to be called when the button is pressed. The callback
can be called a long time after digitButton did its task, but it can still access the digit variable.

Closures are valuable also in a quite different context. Because functions are stored in regular variables,
we can easily redefine functions in Lua, even predefined functions. This facility is one of the reasons
why Lua is so flexible. Frequently, when we redefine a function, we need the original function in the
new implementation. As an example, suppose we want to redefine the function sin to operate in degrees
instead of radians. This new function converts its argument and then calls the original function sin to do
the real work. Our code could look like this:

local oldSin = math.sin


math.sin = function (x)
return oldSin(x * (math.pi / 180))
end

A slightly cleaner way to do this redefinition is as follows:

do
local oldSin = math.sin
local k = math.pi / 180
math.sin = function (x)
return oldSin(x * k)
end
end

This code uses a do block to limit the scope of the local variable oldSin; following conventional visibility
rules, the variable is only visible inside the block. So, the only way to access it is through the new function.

We can use this same technique to create secure environments, also called sandboxes. Secure environments
are essential when running untrusted code, such as code received through the Internet by a server. For
instance, to restrict the files that a program can access, we can redefine io.open using closures:

do
local oldOpen = io.open
local access_OK = function (filename, mode)
check access
end
io.open = function (filename, mode)
if access_OK(filename, mode) then
return oldOpen(filename, mode)
else
return nil, "access denied"
end
end
end

What makes this example nice is that, after this redefinition, there is no way for the program to call the
unrestricted version of function io.open except through the new, restricted version. It keeps the insecure
version as a private variable in a closure, inaccessible from the outside. With this technique, we can build
Lua sandboxes in Lua itself, with the usual benefits: simplicity and flexibility. Instead of a one-size-fits-all
solution, Lua offers us a meta-mechanism, so that we can tailor our environment for our specific security
needs. (Real sandboxes do more than protecting external files. We will return to this subject in the section
called “Sandboxing”.)

73
Closures

A Taste of Functional Programming


To give a more concrete example of functional programming, in this section we will develop a simple
system for geometric regions.1 The goal is to develop a system to represent geometric regions, where a
region is a set of points. We want to be able to represent all kinds of shapes and to combine and modify
shapes in several ways (rotation, translation, union, etc.).

To implement this system, we could start looking for good data structures to represent shapes; we could try
an object-oriented approach and develop some hierarchy of shapes. Or we can work on a higher level of
abstraction and represent our sets directly by their characteristic (or indicator) function. (The characteristic
function of a set A is a function fA such that fA(x) is true if and only if x belongs to A.) Given that a geometric
region is a set of points, we can represent a region by its characteristic function; that is, we represent a
region by a function that, given a point, returns true if and only if the point belongs to the region.

As an example, the next function represents a disk (a circular region) with center (1.0, 3.0) and radius 4.5:

function disk1 (x, y)


return (x - 1.0)^2 + (y - 3.0)^2 <= 4.5^2
end

With higher-order functions and lexical scoping, it is easy to define a disk factory, which creates disks
with given centers and radius:

function disk (cx, cy, r)


return function (x, y)
return (x - cx)^2 + (y - cy)^2 <= r^2
end
end

A call like disk(1.0, 3.0, 4.5) will create a disk equal to disk1.

The next function creates axis-aligned rectangles, given the bounds:

function rect (left, right, bottom, up)


return function (x, y)
return left <= x and x <= right and
bottom <= x and x <= up
end
end

In a similar fashion, we can define functions to create other basic shapes, such as triangles and non–
axis-aligned rectangles. Each shape has a completely independent implementation, needing only a correct
characteristic function.

Now let us see how to modify and combine regions. To create the complement of any region is trivial:

function complement (r)


return function (x, y)
return not r(x, y)
end
end
1
This example is adapted from the research report Haskell vs. Ada vs. C++ vs. Awk vs. ... An Experiment in Software Prototyping Productivity,
by Paul Hudak and Mark P. Jones.

74
Closures

Union, intersection, and difference are equally simple, as we show in Figure 9.1, “Union, intersection, and
difference of regions”.

Figure 9.1. Union, intersection, and difference of regions


function union (r1, r2)
return function (x, y)
return r1(x, y) or r2(x, y)
end
end

function intersection (r1, r2)


return function (x, y)
return r1(x, y) and r2(x, y)
end
end

function difference (r1, r2)


return function (x, y)
return r1(x, y) and not r2(x, y)
end
end

The following function translates a region by a given delta:

function translate (r, dx, dy)


return function (x, y)
return r(x - dx, y - dy)
end
end

To visualize a region, we can traverse the viewport testing each pixel; pixels in the region are painted
black, pixels outside it are painted white. To illustrate this process in a simple way, we will write a function
to generate a PBM (portable bitmap) file with the drawing of a given region.

PBM files have a quite simple structure. (This structure is also highly inefficient, but our emphasis here
is simplicity.) In its text-mode variant, it starts with a one-line header with the string "P1"; then there is
one line with the width and height of the drawing, in pixels. Finally, there is a sequence of digits, one for
each image pixel (1 for black, 0 for white), separated by optional spaces and end of lines. The function
plot in Figure 9.2, “Drawing a region in a PBM file” creates a PBM file for a given region, mapping a
virtual drawing area (-1,1], [-1,1) to the viewport area [1,M], [1,N].

Figure 9.2. Drawing a region in a PBM file


function plot (r, M, N)
io.write("P1\n", M, " ", N, "\n") -- header
for i = 1, N do -- for each line
local y = (N - i*2)/N
for j = 1, M do -- for each column
local x = (j*2 - M)/M
io.write(r(x, y) and "1" or "0")
end
io.write("\n")
end
end

75
Closures

To complete our example, the following command draws a waxing crescent moon (as seen from the South-
ern Hemisphere):

c1 = disk(0, 0, 1)
plot(difference(c1, translate(c1, 0.3, 0)), 500, 500)

Exercises
Exercise 9.1: Write a function integral that takes a function f and returns an approximation of its
integral.

Exercise 9.2: What will be the output of the following chunk:

function F (x)
return {
set = function (y) x = y end,
get = function () return x end
}
end

o1 = F(10)
o2 = F(20)
print(o1.get(), o2.get())
o2.set(100)
o1.set(300)
print(o1.get(), o2.get())

Exercise 9.3: Exercise 5.4 asked you to write a function that receives a polynomial (represented as a table)
and a value for its variable, and returns the polynomial value. Write the curried version of that function.
Your function should receive a polynomial and return a function that, when called with a value for x,
returns the value of the polynomial for that x. See the example:

f = newpoly({3, 0, 1})
print(f(0)) --> 3
print(f(5)) --> 28
print(f(10)) --> 103

Exercise 9.4: Using our system for geometric regions, draw a waxing crescent moon as seen from the
Northern Hemisphere.

Exercise 9.5: In our system for geometric regions, add a function to rotate a given region by a given angle.

76
Chapter 10. Pattern Matching
Unlike several other scripting languages, Lua uses neither POSIX regex nor Perl regular expressions for
pattern matching. The main reason for this decision is size: a typical implementation of POSIX regular
expressions takes more than 4000 lines of code, which is more than half the size of all Lua standard
libraries together. In comparison, the implementation of pattern matching in Lua has less than 600 lines.
Of course, pattern matching in Lua cannot do all that a full POSIX implementation does. Nevertheless,
pattern matching in Lua is a powerful tool, and includes some features that are difficult to match with
standard POSIX implementations.

The Pattern-Matching Functions


The string library offers four functions based on patterns. We have already had a glimpse at find and
gsub; the other two are match and gmatch (Global Match). Now we will see all of them in detail.

The function string.find


The function string.find searches for a pattern inside a given subject string. The simplest form of
a pattern is a word, which matches only a copy of itself. For instance, the pattern 'hello' will search
for the substring "hello" inside the subject string. When string.find finds its pattern, it returns
two values: the index where the match begins and the index where the match ends. If it does not find a
match, it returns nil:

s = "hello world"
i, j = string.find(s, "hello")
print(i, j) --> 1 5
print(string.sub(s, i, j)) --> hello
print(string.find(s, "world")) --> 7 11
i, j = string.find(s, "l")
print(i, j) --> 3 3
print(string.find(s, "lll")) --> nil

When a match succeeds, we can call string.sub with the values returned by find to get the part of
the subject string that matched the pattern. For simple patterns, this is necessarily the pattern itself.

The function string.find has two optional parameters. The third parameter is an index that tells where
in the subject string to start the search. The fourth parameter, a Boolean, indicates a plain search. A plain
search, as the name implies, does a plain “find substring” search in the subject, ignoring patterns:

> string.find("a [word]", "[")


stdin:1: malformed pattern (missing ']')
> string.find("a [word]", "[", 1, true) --> 3 3

In the first call, the function complains because '[' has a special meaning in patterns. In the second call,
the function treats '[' as a simple string. Note that we cannot pass the fourth optional parameter without
the third one.

The function string.match


The function string.match is similar to find, in the sense that it also searches for a pattern in a string.
However, instead of returning the positions where it found the pattern, it returns the part of the subject
string that matched the pattern:

77
Pattern Matching

print(string.match("hello world", "hello")) --> hello

For fixed patterns such as 'hello', this function is pointless. It shows its power when used with variable
patterns, as in the next example:

date = "Today is 17/7/1990"


d = string.match(date, "%d+/%d+/%d+")
print(d) --> 17/7/1990

Shortly we will discuss the meaning of the pattern '%d+/%d+/%d+' and more advanced uses for
string.match.

The function string.gsub


The function string.gsub has three mandatory parameters: a subject string, a pattern, and a replace-
ment string. Its basic use is to substitute the replacement string for all occurrences of the pattern inside
the subject string:

s = string.gsub("Lua is cute", "cute", "great")


print(s) --> Lua is great
s = string.gsub("all lii", "l", "x")
print(s) --> axx xii
s = string.gsub("Lua is great", "Sol", "Sun")
print(s) --> Lua is great

An optional fourth parameter limits the number of substitutions to be made:

s = string.gsub("all lii", "l", "x", 1)


print(s) --> axl lii
s = string.gsub("all lii", "l", "x", 2)
print(s) --> axx lii

Instead of a replacement string, the third argument to string.gsub can be also a function or a table,
which is called (or indexed) to produce the replacement string; we will cover this feature in the section
called “Replacements”.

The function string.gsub also returns as a second result the number of times it made the substitution.

The function string.gmatch


The function string.gmatch returns a function that iterates over all occurrences of a pattern in a string.
For instance, the following example collects all words of a given string s:

s = "some string"
words = {}
for w in string.gmatch(s, "%a+") do
words[#words + 1] = w
end

As we will discuss shortly, the pattern '%a+' matches sequences of one or more alphabetic characters (that
is, words). So, the for loop will iterate over all words of the subject string, storing them in the list words.

Patterns
Most pattern-matching libraries use the backslash as an escape. However, this choice has some annoying
consequences. For the Lua parser, patterns are regular strings. They have no special treatment and follow

78
Pattern Matching

the same rules as other strings. Only the pattern-matching functions interpret them as patterns. Because
the backslash is the escape character in Lua, we have to escape it to pass it to any function. Patterns are
naturally hard to read, and writing "\\" instead of "\" everywhere does not help.

We could ameliorate this problem with long strings, enclosing patterns between double brackets. (Some
languages recommend this practice.) However, the long-string notation seems cumbersome for patterns,
which are usually short. Moreover, we would lose the ability to use escapes inside patterns. (Some pat-
tern-matching tools work around this limitation by reimplementing the usual string escapes.)

Lua's solution is simpler: patterns in Lua use the percent sign as an escape. (Several functions in C, such
as printf and strftime, adopt the same solution.) In general, any escaped alphanumeric character
has some special meaning (e.g., '%a' matches any letter), while any escaped non-alphanumeric character
represents itself (e.g., '%.' matches a dot).

We will start our discussion about patterns with character classes. A character class is an item in a pattern
that can match any character in a specific set. For instance, the class %d matches any digit. Therefore, we
can search for a date in the format dd/mm/yyyy with the pattern '%d%d/%d%d/%d%d%d%d':

s = "Deadline is 30/05/1999, firm"


date = "%d%d/%d%d/%d%d%d%d"
print(string.match(s, date)) --> 30/05/1999

The following table lists the predefined character classes with their meanings:

. all characters
%a letters
%c control characters
%d digits
%g printable characters except spaces
%l lower-case letters
%p punctuation characters
%s space characters
%u upper-case letters
%w alphanumeric characters
%x hexadecimal digits

An upper-case version of any of these classes represents the complement of the class. For instance, '%A'
represents all non-letter characters:

print((string.gsub("hello, up-down!", "%A", ".")))


--> hello..up.down.

(When printing the results of gsub, I am using extra parentheses to discard its second result, which is
the number of substitutions.)

Some characters, called magic characters, have special meanings when used in a pattern. Patterns in Lua
use the following magic characters:

( ) . % + - * ? [ ] ^ $

As we have seen, the percent sign works as an escape for these magic characters. So, '%?' matches a
question mark and '%%' matches a percent sign itself. We can escape not only the magic characters, but
also any non-alphanumeric character. When in doubt, play safe and use an escape.

79
Pattern Matching

A char-set allows us to create our own character classes, grouping single characters and classes inside
square brackets. For instance, the char-set '[%w_]' matches both alphanumeric characters and underscores,
'[01]' matches binary digits, and '[%[%]]' matches square brackets. To count the number of vowels in
a text, we can write this code:

_, nvow = string.gsub(text, "[AEIOUaeiou]", "")

We can also include character ranges in a char-set, by writing the first and the last characters of the range
separated by a hyphen. I seldom use this feature, because most useful ranges are predefined; for instance,
'%d' substitutes '[0-9]', and '%x' substitutes '[0-9a-fA-F]'. However, if you need to find an octal digit,
you may prefer '[0-7]' instead of an explicit enumeration like '[01234567]'.

We can get the complement of any char-set by starting it with a caret: the pattern '[^0-7]' finds any
character that is not an octal digit and '[^\n]' matches any character different from newline. Nevertheless,
remember that you can negate simple classes with its upper-case version: '%S' is simpler than '[^%s]'.

We can make patterns still more useful with modifiers for repetitions and optional parts. Patterns in Lua
offer four modifiers:

+ 1 or more repetitions
* 0 or more repetitions
- 0 or more lazy repetitions
? optional (0 or 1 occurrence)

The plus modifier matches one or more characters of the original class. It will always get the longest
sequence that matches the pattern. For instance, the pattern '%a+' means one or more letters (a word):

print((string.gsub("one, and two; and three", "%a+", "word")))


--> word, word word; word word

The pattern '%d+' matches one or more digits (an integer numeral):

print(string.match("the number 1298 is even", "%d+")) --> 1298

The asterisk modifier is similar to plus, but it also accepts zero occurrences of characters of the class.
A typical use is to match optional spaces between parts of a pattern. For instance, to match an empty
parenthesis pair, such as () or ( ), we can use the pattern '%(%s*%)', where the pattern '%s*' matches
zero or more spaces. (Parentheses have a special meaning in a pattern, so we must escape them.) As another
example, the pattern '[_%a][_%w]*' matches identifiers in a Lua program: a sequence starting with a
letter or an underscore, followed by zero or more underscores or alphanumeric characters.

Like an asterisk, the minus modifier also matches zero or more occurrences of characters of the original
class. However, instead of matching the longest sequence, it matches the shortest one. Sometimes there
is no difference between asterisk and minus, but usually they give rather different results. For instance, if
we try to find an identifier with the pattern '[_%a][_%w]-', we will find only the first letter, because the
'[_%w]-' will always match the empty sequence. On the other hand, suppose we want to erase comments
in a C program. Many people would first try '/%*.*%*/' (that is, a "/*" followed by a sequence of any
characters followed by "*/", written with the appropriate escapes). However, because the '.*' expands
as far as it can, the first "/*" in the program would close only with the last "*/":

test = "int x; /* x */ int y; /* y */"


print((string.gsub(test, "/%*.*%*/", "")))
--> int x;

The pattern '.-', instead, will expand only as much as necessary to find the first "*/", so that we get
the desired result:

80
Pattern Matching

test = "int x; /* x */ int y; /* y */"


print((string.gsub(test, "/%*.-%*/", "")))
--> int x; int y;

The last modifier, the question mark, matches an optional character. As an example, suppose we want to
find an integer in a text, where the number can contain an optional sign. The pattern '[+-]?%d+' does the
job, matching numerals like "-12", "23", and "+1009". The character class '[+-]' matches either a
plus or a minus sign; the following ? makes this sign optional.

Unlike some other systems, in Lua we can apply a modifier only to a character class; there is no way to
group patterns under a modifier. For instance, there is no pattern that matches an optional word (unless
the word has only one letter). Usually, we can circumvent this limitation using some of the advanced
techniques that we will see in the end of this chapter.

If a pattern begins with a caret, it will match only at the beginning of the subject string. Similarly, if it
ends with a dollar sign, it will match only at the end of the subject string. We can use these marks both
to restrict the matches that we find and to anchor patterns. For instance, the next test checks whether the
string s starts with a digit:

if string.find(s, "^%d") then ...

The next one checks whether that string represents an integer number, without any other leading or trailing
characters:

if string.find(s, "^[+-]?%d+$") then ...

The caret and dollar signs are magic only when used in the beginning or end of the pattern. Otherwise,
they act as regular characters matching themselves.

Another item in a pattern is '%b', which matches balanced strings. We write this item as '%bxy', where
x and y are any two distinct characters; the x acts as an opening character and the y as the closing one.
For instance, the pattern '%b()' matches parts of the string that start with a left parenthesis and finish at
the respective right one:

s = "a (enclosed (in) parentheses) line"


print((string.gsub(s, "%b()", ""))) --> a line

Typically, we use this pattern as '%b()', '%b[]', '%b{}', or '%b<>', but we can use any two distinct char-
acters as delimiters.

Finally, the item '%f[char-set]' represents a frontier pattern. It matches an empty string only if the
next character is in char-set but the previous one is not:

s = "the anthem is the theme"


print((string.gsub(s, "%f[%w]the%f[%W]", "one")))
--> one anthem is one theme

The pattern '%f[%w]' matches a frontier between a non-alphanumeric and an alphanumeric character,
and the pattern '%f[%W]' matches a frontier between an alphanumeric and a non-alphanumeric character.
Therefore, the given pattern matches the string "the" only as an entire word. Note that we must write
the char-set inside brackets, even when it is a single class.

The frontier pattern treats the positions before the first and after the last characters in the subject string as
if they had the null character (ASCII code zero). In the previous example, the first "the" starts with a
frontier between a null character, which is not in the set '[%w]', and a t, which is.

81
Pattern Matching

Captures
The capture mechanism allows a pattern to yank parts of the subject string that match parts of the pattern
for further use. We specify a capture by writing the parts of the pattern that we want to capture between
parentheses.

When a pattern has captures, the function string.match returns each captured value as a separate
result; in other words, it breaks a string into its captured parts.

pair = "name = Anna"


key, value = string.match(pair, "(%a+)%s*=%s*(%a+)")
print(key, value) --> name Anna

The pattern '%a+' specifies a non-empty sequence of letters; the pattern '%s*' specifies a possibly empty
sequence of spaces. So, in the example above, the whole pattern specifies a sequence of letters, followed
by a sequence of spaces, followed by an equals sign, again followed by spaces, plus another sequence of
letters. Both sequences of letters have their patterns enclosed in parentheses, so that they will be captured
if a match occurs. Below is a similar example:

date = "Today is 17/7/1990"


d, m, y = string.match(date, "(%d+)/(%d+)/(%d+)")
print(d, m, y) --> 17 7 1990

In this example, we use three captures, one for each sequence of digits.

In a pattern, an item like '%n', where n is a single digit, matches only a copy of the n-th capture. As a
typical use, suppose we want to find, inside a string, a substring enclosed between single or double quotes.
We could try a pattern such as '["'].-["']', that is, a quote followed by anything followed by another
quote; but we would have problems with strings like "it's all right". To solve this problem, we
can capture the first quote and use it to specify the second one:

s = [[then he said: "it's all right"!]]


q, quotedPart = string.match(s, "([\"'])(.-)%1")
print(quotedPart) --> it's all right
print(q) --> "

The first capture is the quote character itself and the second capture is the contents of the quote (the
substring matching the '.-').

A similar example is this pattern, which matches long strings in Lua:

%[(=*)%[(.-)%]%1%]

It will match an opening square bracket followed by zero or more equals signs, followed by another opening
square bracket, followed by anything (the string content), followed by a closing square bracket, followed
by the same number of equals signs, followed by another closing square bracket:

p = "%[(=*)%[(.-)%]%1%]"
s = "a = [=[[[ something ]] ]==] ]=]; print(a)"
print(string.match(s, p)) --> = [[ something ]] ]==]

The first capture is the sequence of equals signs (only one sign in this example); the second is the string
content.

The third use of captured values is in the replacement string of gsub. Like the pattern, the replacement
string can also contain items like "%n", which are changed to the respective captures when the substitu-

82
Pattern Matching

tion is made. In particular, the item "%0" becomes the whole match. (By the way, a percent sign in the
replacement string must be escaped as "%%".) As an example, the following command duplicates every
letter in a string, with a hyphen between the copies:

print((string.gsub("hello Lua!", "%a", "%0-%0")))


--> h-he-el-ll-lo-o L-Lu-ua-a!

This one interchanges adjacent characters:

print((string.gsub("hello Lua", "(.)(.)", "%2%1")))


--> ehll ouLa

As a more useful example, let us write a primitive format converter, which gets a string with commands
written in a LaTeX style and changes them to a format in XML style:

\command{some text} --> <command>some text</command>

If we disallow nested commands, the following call to string.gsub does the job:

s = [[the \quote{task} is to \em{change} that.]]


s = string.gsub(s, "\\(%a+){(.-)}", "<%1>%2</%1>")
print(s)
--> the <quote>task</quote> is to <em>change</em> that.

(In the next section, we will see how to handle nested commands.)

Another useful example is how to trim a string:

function trim (s)


s = string.gsub(s, "^%s*(.-)%s*$", "%1")
return s
end

Note the judicious use of pattern modifiers. The two anchors (^ and $) ensure that we get the whole string.
Because the '.-' in the middle tries to expand as little as possible, the two enclosing patterns '%s*' match
all spaces at both extremities.

Replacements
As we have seen already, we can use either a function or a table as the third argument to string.gsub,
instead of a string. When invoked with a function, string.gsub calls the function every time it finds
a match; the arguments to each call are the captures, and the value that the function returns becomes the
replacement string. When invoked with a table, string.gsub looks up the table using the first capture
as the key, and the associated value is used as the replacement string. If the result from the call or from
the table lookup is nil, gsub does not change the match.

As a first example, the following function does variable expansion: it substitutes the value of the global
variable varname for every occurrence of $varname in a string:

function expand (s)


return (string.gsub(s, "$(%w+)", _G))
end

name = "Lua"; status = "great"


print(expand("$name is $status, isn't it?"))
--> Lua is great, isn't it?

83
Pattern Matching

(As we will discuss in detail in Chapter 22, The Environment, _G is a predefined table containing all global
variables.) For each match with '$(%w+)' (a dollar sign followed by a name), gsub looks up the captured
name in the global table _G; the result replaces the match. When the table does not have the key, there
is no replacement:

print(expand("$othername is $status, isn't it?"))


--> $othername is great, isn't it?

If we are not sure whether the given variables have string values, we may want to apply tostring to
their values. In this case, we can use a function as the replacement value:

function expand (s)


return (string.gsub(s, "$(%w+)", function (n)
return tostring(_G[n])
end))
end

print(expand("print = $print; a = $a"))


--> print = function: 0x8050ce0; a = nil

Inside expand, for each match with '$(%w+)', gsub calls the given function with the captured name as
argument; the returned string replaces the match.

The last example goes back to our format converter from the previous section. Again, we want to convert
commands in LaTeX style (\example{text}) to XML style (<example>text</example>), but
allowing nested commands this time. The following function uses recursion to do the job:

function toxml (s)


s = string.gsub(s, "\\(%a+)(%b{})", function (tag, body)
body = string.sub(body, 2, -2) -- remove the brackets
body = toxml(body) -- handle nested commands
return string.format("<%s>%s</%s>", tag, body, tag)
end)
return s
end

print(toxml("\\title{The \\bold{big} example}"))


--> <title>The <bold>big</bold> example</title>

URL encoding
Our next example will use URL encoding, which is the encoding used by HTTP to send parameters em-
bedded in a URL. This encoding represents special characters (such as =, &, and +) as "%xx", where xx
is the character code in hexadecimal. After that, it changes spaces to plus signs. For instance, it encodes
the string "a+b = c" as "a%2Bb+%3D+c". Finally, it writes each parameter name and parameter value
with an equals sign in between and appends all resulting pairs name = value with an ampersand in
between. For instance, the values

name = "al"; query = "a+b = c"; q="yes or no"

are encoded as "name=al&query=a%2Bb+%3D+c&q=yes+or+no".

Now, suppose we want to decode this URL and store each value in a table, indexed by its corresponding
name. The following function does the basic decoding:

function unescape (s)

84
Pattern Matching

s = string.gsub(s, "+", " ")


s = string.gsub(s, "%%(%x%x)", function (h)
return string.char(tonumber(h, 16))
end)
return s
end

print(unescape("a%2Bb+%3D+c")) --> a+b = c

The first gsub changes each plus sign in the string to a space. The second gsub matches all two-digit
hexadecimal numerals preceded by a percent sign and calls an anonymous function for each match. This
function converts the hexadecimal numeral into a number (using tonumber with base 16) and returns
the corresponding character (string.char).

To decode the pairs name=value, we use gmatch. Because neither names nor values can contain either
ampersands or equals signs, we can match them with the pattern '[^&=]+':

cgi = {}
function decode (s)
for name, value in string.gmatch(s, "([^&=]+)=([^&=]+)") do
name = unescape(name)
value = unescape(value)
cgi[name] = value
end
end

The call to gmatch matches all pairs in the form name=value. For each pair, the iterator returns the
corresponding captures (as marked by the parentheses in the matching string) as the values for name and
value. The loop body simply applies unescape to both strings and stores the pair in the cgi table.

The corresponding encoding is also easy to write. First, we write the escape function; this function
encodes all special characters as a percent sign followed by the character code in hexadecimal (the format
option "%02X" makes a hexadecimal number with two digits, using 0 for padding), and then changes
spaces to plus signs:

function escape (s)


s = string.gsub(s, "[&=+%%%c]", function (c)
return string.format("%%%02X", string.byte(c))
end)
s = string.gsub(s, " ", "+")
return s
end

The encode function traverses the table to be encoded, building the resulting string:

function encode (t)


local b = {}
for k,v in pairs(t) do
b[#b + 1] = (escape(k) .. "=" .. escape(v))
end
-- concatenates all entries in 'b', separated by "&"
return table.concat(b, "&")
end

t = {name = "al", query = "a+b = c", q = "yes or no"}


print(encode(t)) --> q=yes+or+no&query=a%2Bb+%3D+c&name=al

85
Pattern Matching

Tab expansion
An empty capture like '()' has a special meaning in Lua. Instead of capturing nothing (a useless task), this
pattern captures its position in the subject string, as a number:

print(string.match("hello", "()ll()")) --> 3 5

(Note that the result of this example is not the same as what we get from string.find, because the
position of the second empty capture is after the match.)

A nice example of the use of position captures is for expanding tabs in a string:

function expandTabs (s, tab)


tab = tab or 8 -- tab "size" (default is 8)
local corr = 0 -- correction
s = string.gsub(s, "()\t", function (p)
local sp = tab - (p - 1 + corr)%tab
corr = corr - 1 + sp
return string.rep(" ", sp)
end)
return s
end

The gsub pattern matches all tabs in the string, capturing their positions. For each tab, the anonymous
function uses this position to compute the number of spaces needed to arrive at a column that is a multiple
of tab: it subtracts one from the position to make it relative to zero and adds corr to compensate for
previous tabs. (The expansion of each tab affects the position of the following ones.) It then updates the
correction for the next tab: minus one for the tab being removed, plus sp for the spaces being added.
Finally, it returns a string with the appropriate number of spaces to replace the tab.

Just for completeness, let us see how to reverse this operation, converting spaces to tabs. A first approach
could also involve the use of empty captures to manipulate positions, but there is a simpler solution: at
every eighth character, we insert a mark in the string. Then, wherever the mark is preceded by spaces, we
replace the sequence spaces–mark by a tab:

function unexpandTabs (s, tab)


tab = tab or 8
s = expandTabs(s, tab)
local pat = string.rep(".", tab)
s = string.gsub(s, pat, "%0\1")
s = string.gsub(s, " +\1", "\t")
s = string.gsub(s, "\1", "")
return s
end

The function starts by expanding the string to remove any previous tabs. Then it computes an auxiliary
pattern for matching all sequences of eight characters, and uses this pattern to add a mark (the control
character \1) after every eight characters. It then substitutes a tab for all sequences of one or more spaces
followed by a mark. Finally, it removes the marks left (those not preceded by spaces).

Tricks of the Trade


Pattern matching is a powerful tool for manipulating strings. We can perform many complex operations
with only a few calls to string.gsub. However, as with any power, we must use it carefully.

86
Pattern Matching

Pattern matching is not a replacement for a proper parser. For quick-and-dirty programs, we can do useful
manipulations on source code, but it may be hard to build a product with quality. As a good example,
consider the pattern we used to match comments in a C program: '/%*.-%*/'. If the program has a literal
string containing "/*", we may get a wrong result:

test = [[char s[] = "a /* here"; /* a tricky string */]]


print((string.gsub(test, "/%*.-%*/", "<COMMENT>")))
--> char s[] = "a <COMMENT>

Strings with such contents are rare. For our own use, that pattern will probably do its job, but we should
not distribute a program with such a flaw.

Usually, pattern matching is efficient enough for Lua programs: my new machine takes less than 0.2
seconds to count all words in a 4.4 MB text (850 K-words).1 But we can take precautions. We should
always make the pattern as specific as possible; loose patterns are slower than specific ones. An extreme
example is '(.-)%$', to get all text in a string up to the first dollar sign. If the subject string has a dollar
sign, everything goes fine, but suppose that the string does not contain any dollar signs. The algorithm
will first try to match the pattern starting at the first position of the string. It will go through all the string,
looking for a dollar. When the string ends, the pattern fails for the first position of the string. Then, the
algorithm will do the whole search again, starting at the second position of the string, only to discover that
the pattern does not match there, too, repeating the search for every position in the string. This will take a
quadratic time, resulting in more than four minutes in my new machine for a string of 200K characters. We
can correct this problem simply by anchoring the pattern at the first position of the string, with '^(.-)%
$'. The anchor tells the algorithm to stop the search if it cannot find a match at the first position. With the
anchor, the match runs in a hundredth of a second.

Beware also of empty patterns, that is, patterns that match the empty string. For instance, if we try to match
names with a pattern like '%a*', we will find names everywhere:

i, j = string.find(";$% **#$hello13", "%a*")


print(i,j) --> 1 0

In this example, the call to string.find has correctly found an empty sequence of letters at the begin-
ning of the string.

It never makes sense to write a pattern that ends with the minus modifier, because it will match only the
empty string. This modifier always needs something after it to anchor its expansion. Similarly, patterns
that include '.*' are tricky, because this construction can expand much more than we intended.

Sometimes, it is useful to use Lua itself to build a pattern. We already used this trick in our function to
convert spaces to tabs. As another example, let us see how we can find long lines in a text, for instance lines
with more than 70 characters. A long line is a sequence of 70 or more characters different from newline.
We can match a single character different from newline with the character class '[^\n]'. Therefore, we
can match a long line with a pattern that repeats 70 times the pattern for one character, finishing in a
repetition for that pattern (to match the rest of the line). Instead of writing this pattern by hand, we can
create it with string.rep:

pattern = string.rep("[^\n]", 70) .. "+"

As another example, suppose we want to make a case-insensitive search. A way of doing this is to change
any letter x in the pattern to the class '[xX]', that is, a class including both the lower and the upper-case
versions of the original letter. We can automate this conversion with a function:

function nocase (s)


1
“My new machine” is an Intel Core i7-4790 3.6 GHz, with 8 GB of RAM. I measured all performance data in this book on that machine.

87
Pattern Matching

s = string.gsub(s, "%a", function (c)


return "[" .. string.lower(c) .. string.upper(c) .. "]"
end)
return s
end

print(nocase("Hi there!")) --> [hH][iI] [tT][hH][eE][rR][eE]!

Sometimes, we want to replace every plain occurrence of s1 with s2, without regarding any character as
magic. If the strings s1 and s2 are literals, we can add proper escapes to magic characters while we write
the strings. If these strings are variable values, we can use another gsub to put the escapes for us:

s1 = string.gsub(s1, "(%W)", "%%%1")


s2 = string.gsub(s2, "%%", "%%%%")

In the search string, we escape all non-alphanumeric characters (thus the upper-case W). In the replacement
string, we escape only the percent sign.

Another useful technique for pattern matching is to preprocess the subject string before the real work.
Suppose we want to change to upper case all quoted strings in a text, where a quoted string starts and ends
with a double quote ("), but may contain escaped quotes ("\""):

follows a typical string: "This is \"great\"!".

One approach for handling such cases is to preprocess the text to encode the problematic sequence as
something else. For instance, we could code "\"" as "\1". However, if the original text already contains
a "\1", we are in trouble. An easy way to do the encoding and avoid this problem is to code all sequences
"\x" as "\ddd", where ddd is the decimal representation of the character x:

function code (s)


return (string.gsub(s, "\\(.)", function (x)
return string.format("\\%03d", string.byte(x))
end))
end

Now any sequence "\ddd" in the encoded string must have come from the coding, because any "\ddd"
in the original string has been coded, too. So, the decoding is an easy task:

function decode (s)


return (string.gsub(s, "\\(%d%d%d)", function (d)
return "\\" .. string.char(tonumber(d))
end))
end

Now we can complete our task. As the encoded string does not contain any escaped quote ("\""), we can
search for quoted strings simply with '".-"':

s = [[follows a typical string: "This is \"great\"!".]]


s = code(s)
s = string.gsub(s, '".-"', string.upper)
s = decode(s)
print(s) --> follows a typical string: "THIS IS \"GREAT\"!".

We can also write it like here:

print(decode(string.gsub(code(s), '".-"', string.upper)))

88
Pattern Matching

The applicability of pattern-matching functions to UTF-8 strings depends on the pattern. Literal patterns
work without problems, due to the key property of UTF-8 that the encoding of any character never appears
inside the encoding of any other character. Character classes and character sets work only for ASCII
characters. For instance, the pattern '%s' works on UTF-8 strings, but it will match only the ASCII white
spaces; it will not match extra Unicode white spaces such as a non-break space (U+00A0) or a Mongolian
vowel separator (U+180E).

Judicious patterns can bring some extra power to Unicode handling. A good example is the predefined
pattern utf8.charpattern, which matches exactly one UTF-8 character. The utf8 library defines
this pattern as follows:

utf8.charpattern = [\0-\x7F\xC2-\xF4][\x80-\xBF]*

The first part is a class that matches either ASCII characters (range [0, 0x7F]) or initial bytes for multibyte
sequences (range [0xC2, 0xF4]). The second part matches zero or more continuation bytes (range [0x80,
0xBF]).

Exercises
Exercise 10.1: Write a function split that receives a string and a delimiter pattern and returns a sequence
with the chunks in the original string separated by the delimiter:

t = split("a whole new world", " ")


-- t = {"a", "whole", "new", "world"}

How does your function handle empty strings? (In particular, is an empty string an empty sequence or a
sequence with one empty string?)

Exercise 10.2: The patterns '%D' and '[^%d]' are equivalent. What about the patterns '[^%d%u]' and '[%D
%U]'?

Exercise 10.3: Write a function transliterate. This function receives a string and replaces each character
in that string with another character, according to a table given as a second argument. If the table maps
a to b, the function should replace any occurrence of a with b. If the table maps a to false, the function
should remove occurrences of a from the resulting string.

Exercise 10.4: At the end of the section called “Captures”, we defined a trim function. Because of its
use of backtracking, this function can take a quadratic time for some strings. (For instance, in my new
machine, a match for a 100 KB string can take 52 seconds.)

• Create a string that triggers this quadratic behavior in function trim.

• Rewrite that function so that it always works in linear time.

Exercise 10.5: Write a function to format a binary string as a literal in Lua, using the escape sequence
\x for all bytes:

print(escape("\0\1hello\200"))
--> \x00\x01\x68\x65\x6C\x6C\x6F\xC8

As an improved version, use also the escape sequence \z to break long lines.

Exercise 10.6: Rewrite the function transliterate for UTF-8 characters.

Exercise 10.7: Write a function to reverse a UTF-8 string.

89
Chapter 11. Interlude: Most Frequent
Words
In this interlude we will develop a program that reads a text and prints the most frequent words in that text.
As in the previous interlude, the program here is quite simple, but it uses some more advanced features,
such as iterators and anonymous functions.

The main data structure of our program is a table that maps each word found in the text to its frequency
counter. With this data structure, the program has three main tasks:

• Read the text, counting the number of occurrences of each word.

• Sort the list of words in descending order of frequencies.

• Print the first n entries in the sorted list.

To read the text, we can iterate over all its lines and, for each line, we iterate over all its words. For each
word that we read, we increment its respective counter:

local counter = {}

for line in io.lines() do


for word in string.gmatch(line, "%w+") do
counter[word] = (counter[word] or 0) + 1
end
end

Here, we describe a “word” using the pattern '%w+', that is, one or more alphanumeric characters.

The next step is to sort the list of words. However, as the attentive reader may have noticed already, we
do not have a list of words to sort! Nevertheless, it is easy to create one, using the words that appear as
keys in table counter:

local words = {} -- list of all words found in the text

for w in pairs(counter) do
words[#words + 1] = w
end

Once we have the list, we can sort it using table.sort:

table.sort(words, function (w1, w2)


return counter[w1] > counter[w2] or
counter[w1] == counter[w2] and w1 < w2
end)

Remember that the order function must return true when w1 must come before w2 in the result. Words
with larger counters come first; words with equal counters come in alphabetical order.

Figure 11.1, “Word-frequency program” presents the complete program.

90
Interlude: Most Frequent Words

Figure 11.1. Word-frequency program


local counter = {}

for line in io.lines() do


for word in string.gmatch(line, "%w+") do
counter[word] = (counter[word] or 0) + 1
end
end

local words = {} -- list of all words found in the text

for w in pairs(counter) do
words[#words + 1] = w
end

table.sort(words, function (w1, w2)


return counter[w1] > counter[w2] or
counter[w1] == counter[w2] and w1 < w2
end)

-- number of words to print


local n = math.min(tonumber(arg[1]) or math.huge, #words)

for i = 1, n do
io.write(words[i], "\t", counter[words[i]], "\n")
end

The last loop prints the result, which is the first n words and their respective counters. The program assumes
that its first argument is the number of words to be printed; by default, it prints all words if no argument
is given.

As an example, we show the result of applying this program over this book:

$ lua wordcount.lua 10 < book.of


the 5996
a 3942
to 2560
is 1907
of 1898
in 1674
we 1496
function 1478
and 1424
x 1266

Exercises
Exercise 11.1: When we apply the word-frequency program to a text, usually the most frequent words
are uninteresting small words like articles and prepositions. Change the program so that it ignores words
with less than four letters.

Exercise 11.2: Repeat the previous exercise but, instead of using length as the criterion for ignoring a word,
the program should read from a text file a list of words to be ignored.

91
Chapter 12. Date and Time
The standard libraries offer few functions to manipulate date and time in Lua. As usual, all it offers is
what is available in the standard C libraries. Nevertheless, despite its apparent simplicity, we can make
quite a lot with this basic support.

Lua uses two representations for date and time. The first one is through a single number, usually an integer.
Although not required by ISO C, on most systems this number is the number of seconds since some fixed
date, called the epoch. In particular, both in POSIX and Windows systems the epoch is Jan 01, 1970, 0:00
UTC.

The second representation that Lua uses for dates and times is a table. Such date tables have the following
significant fields: year, month, day, hour, min, sec, wday, yday, and isdst. All fields except
isdst have integer values. The first six fields have obvious meanings. The wday field is the day of
the week (one is Sunday); the yday field is the day of the year (one is January 1st). The isdst field
is a Boolean, true if daylight saving is in effect. As an example, Sep 16, 1998, 23:48:10 (a Wednesday)
corresponds to the following table:

{year = 1998, month = 9, day = 16, yday = 259, wday = 4,


hour = 23, min = 48, sec = 10, isdst = false}

Date tables do not encode a time zone. It is up to the program to interpret them correctly with respect to
time zones.

The Function os.time


The function os.time, when called without arguments, returns the current date and time, coded as a
number:

> os.time() --> 1439653520

This date corresponds to Aug 15, 2015, 12:45:20.1 In a POSIX system, we can use some basic arithmetic
to decompose that number:

local date = 1439653520


local day2year = 365.242 -- days in a year
local sec2hour = 60 * 60 -- seconds in an hour
local sec2day = sec2hour * 24 -- seconds in a day
local sec2year = sec2day * day2year -- seconds in a year

-- year
print(date // sec2year + 1970) --> 2015.0

-- hour (in UTC)


print(date % sec2day // sec2hour) --> 15

-- minutes
print(date % sec2hour // 60) --> 45

-- seconds
print(date % 60) --> 20
1
Unless otherwise stated, my dates are from a POSIX system running in Rio de Janeiro.

92
Date and Time

We can also call os.time with a date table, to convert the table representation to a number. The year,
month, and day fields are mandatory. The hour, min, and sec fields default to noon (12:00:00) when
not provided. Other fields (including wday and yday) are ignored.

> os.time({year=2015, month=8, day=15, hour=12, min=45, sec=20})


--> 1439653520
> os.time({year=1970, month=1, day=1, hour=0}) --> 10800
> os.time({year=1970, month=1, day=1, hour=0, sec=1})
--> 10801
> os.time({year=1970, month=1, day=1}) --> 54000

Note that 10800 is three hours (the time zone) in seconds and 54000 is 10800 plus 12 hours in seconds.

The Function os.date


The function os.date, despite its name, is a kind of reverse of os.time: it converts a number repre-
senting the date and time to some higher-level representation, either a date table or a string. Its first para-
meter is a format string, describing the representation we want. The second parameter is the numeric date–
time; it defaults to the current date and time if not provided.

To produce a date table, we use the format string "*t". For instance, the call os.date("*t",
906000490) returns the following table:

{year = 1998, month = 9, day = 16, yday = 259, wday = 4,


hour = 23, min = 48, sec = 10, isdst = false}

In general, we have that os.time(os.date("*t", t)) == t, for any valid time t.

Except for isdst, the resulting fields are integers in the following ranges:

year a full year


month 1–12
day 1–31
hour 0–23
min 0–59
sec 0–60
wday 1–7
yday 1–366

(Seconds can go up to 60 to allow for leap seconds.)

For other format strings, os.date returns a copy of the string with specific directives replaced by infor-
mation about the given time and date. A directive consists of a percent sign followed by a letter, as in
the next example:

print(os.date("a %A in %B")) --> a Tuesday in May


print(os.date("%d/%m/%Y", 906000490)) --> 16/09/1998

When relevant, representations follow the current locale. For instance, in a locale for Brazil–Portuguese,
%A would result in "terça-feira" and %B in "maio".

Figure 12.1, “Directives for function os.date” shows the main directives. For each directive, it presents
its meaning and its value for September 16, 1998 (a Wednesday), at 23:48:10.

93
Date and Time

Figure 12.1. Directives for function os.date

%a abbreviated weekday name (e.g., Wed)


%A full weekday name (e.g., Wednesday)
%b abbreviated month name (e.g., Sep)
%B full month name (e.g., September)
%c date and time (e.g., 09/16/98 23:48:10)
%d day of the month (16) [01–31]
%H hour, using a 24-hour clock (23) [00–23]
%I hour, using a 12-hour clock (11) [01–12]
%j day of the year (259) [001–365]
%m month (09) [01–12]
%M minute (48) [00–59]
%p either "am" or "pm" (pm)
%S second (10) [00–60]
%w weekday (3) [0–6 = Sunday–Saturday]
%W week of the year (37) [00–53]
%x date (e.g., 09/16/98)
%X time (e.g., 23:48:10)
%y two-digit year (98) [00–99]
%Y full year (1998)
%z timezone (e.g., -0300)
%% a percent sign

For numerical values, the table shows also their range of possible values. Here are some examples, showing
how to create some ISO 8601 formats:

t = 906000490
-- ISO 8601 date
print(os.date("%Y-%m-%d", t)) --> 1998-09-16
-- ISO 8601 combined date and time
print(os.date("%Y-%m-%dT%H:%M:%S", t)) --> 1998-09-16T23:48:10
-- ISO 8601 ordinal date
print(os.date("%Y-%j", t)) --> 1998-259

If the format string starts with an exclamation mark, then os.date interprets the time in UTC:

-- the Epoch
print(os.date("!%c", 0)) --> Thu Jan 1 00:00:00 1970

If we call os.date without any arguments, it uses the %c format, that is, date and time information in
a reasonable format. Note that the representations for %x, %X, and %c change according to the locale and
the system. If you want a fixed representation, such as dd/mm/yyyy, use an explicit format string, such
as "%d/%m/%Y".

94
Date and Time

Date–Time Manipulation
When os.date creates a date table, its fields are all in the proper ranges. However, when we give a date
table to os.time, its fields do not need to be normalized. This feature is an important tool to manipulate
dates and times.

As a simple example, suppose we want to know the date 40 days from now. We can compute that date
as follows:

t = os.date("*t") -- get current time


print(os.date("%Y/%m/%d", os.time(t))) --> 2015/08/18
t.day = t.day + 40
print(os.date("%Y/%m/%d", os.time(t))) --> 2015/09/27

If we convert the numeric time back to a table, we get a normalized version of that date–time:

t = os.date("*t")
print(t.day, t.month) --> 26 2
t.day = t.day - 40
print(t.day, t.month) --> -14 2
t = os.date("*t", os.time(t))
print(t.day, t.month) --> 17 1

In this example, Feb -14 has been normalized to Jan 17, which is 40 days before Feb 26.

In most systems, we could also add or subtract 3456000 (40 days in seconds) to the numeric time. However,
the C standard does not guarantee the correctness of this operation, because it does not require numeric
times to denote seconds from some epoch. Moreover, if we want to add some months instead of days, the
direct manipulation of seconds becomes problematic, as different months have different durations. The
normalization method, on the other hand, has none of these problems:

t = os.date("*t") -- get current time


print(os.date("%Y/%m/%d", os.time(t))) --> 2015/08/18
t.month = t.month + 6 -- six months from now
print(os.date("%Y/%m/%d", os.time(t))) --> 2016/02/18

We have to be careful when manipulating dates. Normalization works in a somewhat obvious way, but it
may have some non-obvious consequences. For instance, if we compute one month after March 31, that
would give April 31, which is normalized to May 1 (one day after April 30). That sounds quite natural.
However, if we take one month back from that result (May 1), we arrive on April 1, not the original March
31. Note that this mismatch is a consequence of the way our calendar works; it has nothing to do with Lua.

To compute the difference between two times, there is the function os.difftime. It returns the differ-
ence, in seconds, between two given numeric times. For most systems, this difference is exactly the result
of subtracting on time from the other. Unlike the subtraction, however, the behavior of os.difftime
is guaranteed in any system. The next example computes the number of days passed between the release
of Lua 5.2 and Lua 5.3:

local t5_3 = os.time({year=2015, month=1, day=12})


local t5_2 = os.time({year=2011, month=12, day=16})
local d = os.difftime(t5_3, t5_2)
print(d // (24 * 3600)) --> 1123.0

With difftime, we can express dates as number of seconds since any arbitrary epoch:

> myepoch = os.time{year = 2000, month = 1, day = 1, hour = 0}

95
Date and Time

> now = os.time{year = 2015, month = 11, day = 20}


> os.difftime(now, myepoch) --> 501336000.0

Using normalization, it is easy to convert that number of seconds back to a legitimate numeric time: we
create a table with the epoch and set its seconds as the number we want to convert, as in the next example.

> T = {year = 2000, month = 1, day = 1, hour = 0}


> T.sec = 501336000
> os.date("%d/%m/%Y", os.time(T)) --> 20/11/2015

We can also use os.difftime to compute the running time of a piece of code. For this task, however,
it is better to use os.clock. The function os.clock returns the number of seconds of CPU time used
by the program. Its typical use is to benchmark a piece of code:

local x = os.clock()
local s = 0
for i = 1, 100000 do s = s + i end
print(string.format("elapsed time: %.2f\n", os.clock() - x))

Unlike os.time, os.clock usually has sub-second precision, so its result is a float. The exact precision
depends on the platform; in POSIX systems, it is typically one microsecond.

Exercises
Exercise 12.1: Write a function that returns the date–time exactly one month after a given date–time.
(Assume the numeric coding of date–time.)

Exercise 12.2: Write a function that returns the day of the week (coded as an integer, one is Sunday) of
a given date.

Exercise 12.3: Write a function that takes a date–time (coded as a number) and returns the number of
seconds passed since the beginning of its respective day.

Exercise 12.4: Write a function that takes a year and returns the day of its first Friday.

Exercise 12.5: Write a function that computes the number of complete days between two given dates.

Exercise 12.6: Write a function that computes the number of complete months between two given dates.

Exercise 12.7: Does adding one month and then one day to a given date give the same result as adding
one day and then one month?

Exercise 12.8: Write a function that produces the system's time zone.

96
Chapter 13. Bits and Bytes
Lua handles binary data similarly to text. A string in Lua can contain any bytes, and almost all library
functions that handle strings can handle arbitrary bytes. We can even do pattern matching on binary data.
On top of that, Lua 5.3 introduced extra facilities to manipulate binary data: besides integer numbers, it
brought bitwise operators and functions to pack and unpack binary data. In this chapter, we will cover
these and other facilities for handling binary data in Lua.

Bitwise Operators
Starting with version 5.3, Lua offers a standard set of bitwise operators on numbers. Unlike arithmetic
operations, bitwise operators only work on integer values. The bitwise operators are & (bitwise AND), |
(bitwise OR), ~ (bitwise exclusive-OR), >> (logical right shift), << (left shift), and the unary ~ (bitwise
NOT). (Note that, in several languages, the exclusive-OR operator is denoted by ^. In Lua, ^ means
exponentiation.)

> string.format("%x", 0xff & 0xabcd) --> cd


> string.format("%x", 0xff | 0xabcd) --> abff
> string.format("%x", 0xaaaa ~ -1) --> ffffffffffff5555
> string.format("%x", ~0) --> ffffffffffffffff

(Several examples in this chapter will use string.format to show results in hexadecimal.)

All bitwise operators work on all bits of integers. In Standard Lua, that means 64 bits. That can be a
problem when implementing algorithms that assume 32-bit integers (e.g., the cryptographic hash SHA-2).
However, it is not difficult to perform 32-bit integer manipulation. Except for the right-shift operation, all
bitwise operations on 64 bits agree with the same operations on 32 bits, if we simply ignore the higher
half bits. The same is true for addition, subtraction, and multiplication. So, all we have to do to operate
on 32-bit integers is to erase the higher 32 bits of an integer before a right shift. (We seldom do divisions
on that kind of computations.)

Both shift operators fill with zeros the vacant bits. This is usually called logical shifts. Lua does not offer
an arithmetic right shift, which fills vacant bits with the signal bit. We can perform the equivalent to
arithmetic shifts with a floor division by an appropriate power of two. (For instance, x // 16 is the
same as an arithmetic shift by four.)

Negative displacements shift in the other direction, that is, a >> n is the same as a << -n:

> string.format("%x", 0xff << 12) --> ff000


> string.format("%x", 0xff >> -12) --> ff000

If the displacement is equal to or larger than the number of bits in the integer representation (64 in Standard
Lua, 32 in Small Lua), the result is zero, as all bits are shifted out of the result:

> string.format("%x", -1 << 80) --> 0

Unsigned Integers
The representation of integers uses one bit to store the signal. Therefore, the maximum integer that we can
represent with 64-bit integers is 263 - 1, instead of 264 - 1. Usually, this difference is irrelevant, as 263 -
1 is quite large already. However, sometimes we cannot waste a bit for the signal, because we are either
handling external data with unsigned integers or implementing some algorithm that needs integers with all
their 64 bits. Moreover, in Small Lua the difference can be quite significant. For instance, if we use a 32-
bit signed integer as a position in a file, we are limited to 2 GB files; an unsigned integer doubles that limit.

97
Bits and Bytes

Lua does not offer explicit support for unsigned integers. Nevertheless, with some care, it is not difficult
to handle unsigned integers in Lua, as we will see now.

We can write constants larger than 263 - 1 directly, despite appearances:

> x = 13835058055282163712 -- 3 << 62


> x --> -4611686018427387904

The problem here is not the constant, but the way Lua prints it: the standard way to print numbers interprets
them as signed integers. We can use the %u or %x options in string.format to see integers as unsigned:

> string.format("%u", x) --> 13835058055282163712


> string.format("0x%X", x) --> 0xC000000000000000

Due to the way signed integers are represented (two's complement), the operations of addition, subtraction,
and multiplication work the same way for signed and unsigned integers:

> string.format("%u", x) --> 13835058055282163712


> string.format("%u", x + 1) --> 13835058055282163713
> string.format("%u", x - 1) --> 13835058055282163711

(With such a large value, multiplying x even by two would overflow, so we did not include that operation
in the example.)

Order operators work differently for signed and unsigned integers. The problem appears when we compare
two integers with a difference in the higher bit. For signed integers, the integer with that bit set is the
smaller, because it represents a negative number:

> 0x7fffffffffffffff < 0x8000000000000000 --> false

This result is clearly incorrect if we regard both integers as unsigned. So, we need a different operation to
compare unsigned integers. Lua 5.3 provides math.ult (unsigned less than) for that need:

> math.ult(0x7fffffffffffffff, 0x8000000000000000) --> true

Another way to do the comparison is to flip the signal bit of both operands before doing a signed com-
parison:

> mask = 0x8000000000000000


> (0x7fffffffffffffff ~ mask) < (0x8000000000000000 ~ mask)
--> true

Unsigned division is also different from its signed version. Figure 13.1, “Unsigned division” shows an
algorithm for unsigned division.

Figure 13.1. Unsigned division


function udiv (n, d)
if d < 0 then
if math.ult(n, d) then return 0
else return 1
end
end
local q = ((n >> 1) // d) << 1
local r = n - q * d
if not math.ult(r, d) then q = q + 1 end
return q
end

98
Bits and Bytes

The first test (d < 0) is equivalent to testing whether d is larger than 263. In that case, the quotient can
only be 1 (if n is equal to or larger than d) or 0. Otherwise, we do the equivalent of dividing the dividend
by two, then dividing the result by the divisor, and then multiplying the result by two. The right shift is
equivalent to an unsigned division by two; the result will be a non-negative signed integer. The subsequent
left shift corrects the quotient, undoing this previous division.

In general, floor(floor(n / 2) / d) * 2 (the computation done by the algorithm) is not equal


to floor(((n / 2) / d) * 2) (the correct result). However, it is not difficult to prove that the
difference is at most one. So, the algorithm computes the rest of the division (in the variable r) and checks
whether it is greater than the divisor: if so, it corrects the quotient (adding one to it) and it is done.

Converting an unsigned integer to/from a float needs some adjustments. To convert an unsigned integer
to a float, we can convert it as a signed integer and correct the result with a modulo operator:

> u = 11529215046068469760 -- an example


> f = (u + 0.0) % 2^64
> string.format("%.0f", f) --> 11529215046068469760

The value of u + 0.0 is -6917529027641081856, because the standard conversion sees u as a signed
integer. The modulo operation brings the value back to the range of unsigned integers. (In real code we do
not need the addition, because the modulo with a float would do the conversion anyway.)

To convert from a float to an unsigned integer, we can use the following code:

> f = 0xA000000000000000.0 -- an example


> u = math.tointeger(((f + 2^63) % 2^64) - 2^63)
> string.format("%x", u) --> a000000000000000

The addition transforms a value greater than 263 in a value greater than 264. The modulo operator then
projects this value to the range [0,263), and the subtraction makes it a “negative” value (that is, a value
with the highest bit set). For a value smaller than 263, the addition keeps it smaller than 264, the modulo
operator does not affect it, and the subtraction restores its original value.

Packing and Unpacking Binary Data


Lua 5.3 also introduced functions for converting between binary data and basic values (numbers and
strings). The function string.pack “packs” values into a binary string; string.unpack extracts
those values from the string.

Both string.pack and string.unpack get as their first argument a format string, which describes
how the data is packed. Each letter in this string describes how to pack/unpack one value; see the following
example:

> s = string.pack("iii", 3, -27, 450)


> #s --> 12
> string.unpack("iii", s) --> 3 -27 450 13

This call to string.pack creates a string with the binary codes of three integers (according to the
description "iii"), each encoding its corresponding argument. The string length will be three times the
native size of an integer (3 times 4 bytes in my machine). The call to string.unpack decodes three
integers (again according to "iii") from the given string and returns the decoded values.

The function string.unpack also returns the position in the string after the last item read, to simplify
iterations. (This explains the 13 in the results of the last example.) Accordingly, it accepts an optional
third argument, which tells where to start reading. For instance, the next example prints all strings packed
inside a given string:

99
Bits and Bytes

s = "hello\0Lua\0world\0"
local i = 1
while i <= #s do
local res
res, i = string.unpack("z", s, i)
print(res)
end
--> hello
--> Lua
--> world

As we will see, the z option means a zero-terminated string, so that the call to unpack extracts the string
at position i from s and returns that string plus the next position for the loop.

There are several options for coding an integer, each corresponding to a native integer size: b (char), h
(short), i (int), and l (long); the option j uses the size of a Lua integer. To use a fixed, machine-in-
dependent size, we can suffix the i option with a number from one to 16. For instance, i7 will produce
seven-byte integers. All sizes check for overflows:

> x = string.pack("i7", 1 << 54)


> string.unpack("i7", x) --> 18014398509481984 8
> x = string.pack("i7", -(1 << 54))
> string.unpack("i7", x) --> -18014398509481984 8
> x = string.pack("i7", 1 << 55)
stdin:1: bad argument #2 to 'pack' (integer overflow)

We can pack and unpack integers wider than native Lua integers but, when unpacking, their actual values
must fit into Lua integers:

> x = string.pack("i12", 2^61)


> string.unpack("i12", x) --> 2305843009213693952 13
> x = "aaaaaaaaaaaa" -- fake a large 12-byte number
> string.unpack("i12", x)
stdin:1: 12-byte integer does not fit into Lua Integer

Each integer option has an upper-case version corresponding to an unsigned integer of the same size:

> s = "\xFF"
> string.unpack("b", s) --> -1 2
> string.unpack("B", s) --> 255 2

Moreover, unsigned integers have an extra option T for size_t. (The size_t type in ISO C is an
unsigned integer larger enough to hold the size of any object.)

We can pack strings in three representations: zero-terminated strings, fixed-length strings, and strings with
explicit length. Zero-terminated strings use the z option. For fixed-length strings, we use the option cn,
where n is the number of bytes in the packed string. The last option for strings stores the string preceded
by its length. In that case, the option has the format sn, where n is the size of the unsigned integer used
to store the length. For instance, the option s1 stores the string length in one byte:

s = string.pack("s1", "hello")
for i = 1, #s do print((string.unpack("B", s, i))) end
--> 5 (length)
--> 104 ('h')
--> 101 ('e')
--> 108 ('l')
--> 108 ('l')

100
Bits and Bytes

--> 111 ('o')

Lua raises an error if the length does not fit into the given size. We can also use a pure s as the option;
in that case, the length is stored as a size_t, which is large enough to hold the length of any string.
(In 64-bit machines, size_t usually is an eight-byte unsigned integer, which may be a waste of space
for small strings.)

For floating-point numbers, there are three options: f for single precision, d for double precision, and n
for a Lua float.

The format string also has options to control the endianess and the alignment of the binary data. By default,
a format uses the machine's native endianess. The > option turns all subsequent encodings in that format
to big endian, or network byte order:

s = string.pack(">i4", 1000000)
for i = 1, #s do print((string.unpack("B", s, i))) end
--> 0
--> 15
--> 66
--> 64

The < option turns to little endian:

s = string.pack("<i2 i2", 500, 24)


for i = 1, #s do print((string.unpack("B", s, i))) end
--> 244
--> 1
--> 24
--> 0

Finally, the = option turns back to the default machine's native endianess.

For alignment, the !n option forces data to align at indices that are multiples of n. More specifically, if
the item is smaller than n, it is aligned at its own size; otherwise, it is aligned at n. For instance, suppose
we start the format string with !4. Then, one-byte integers will be written in indices multiple of one (that
is, any index), two-byte integers will be written in indices multiple of two, and four-byte or larger integers
will be written in indices multiple of four. The ! option (without a number) sets the alignment to the
machine's native alignment.

The function string.pack does the alignment by adding zeros to the resulting string until the index
has a proper value. The function string.unpack simply skips the padding when reading the string.
Alignment only works for powers of two: if we set the alignment to four and try to manipulate a three-
byte integer, Lua will raise an error.

Any format string works as if prefixed by "=!1", which means native endianess and no alignment (as
every index is a multiple of one). We can change the endianess and the alignment at any point during
the translation.

If needed, we can add padding manually. The option x means one byte of padding; string.pack adds
a zero byte to the resulting string; string.unpack skips one byte from the subject string.

Binary files
The functions io.input and io.output always open a file in text mode. In POSIX, there is no dif-
ference between binary files and text files. In some systems like Windows, however, we must open binary
files in a special way, using the letter b in the mode string of io.open.

101
Bits and Bytes

Typically, we read binary data either with the "a" pattern, that reads the whole file, or with the pattern
n, that reads n bytes. (Lines make no sense in a binary file.) As a simple example, the following program
converts a text file from Windows format to POSIX format —that is, it translates sequences of carriage
return–newlines to newlines:

local inp = assert(io.open(arg[1], "rb"))


local out = assert(io.open(arg[2], "wb"))

local data = inp:read("a")


data = string.gsub(data, "\r\n", "\n")
out:write(data)

assert(out:close())

It cannot use the standard I/O streams (stdin/stdout), because these streams are open in text mode.
Instead, it assumes that the names of the input file and the output file are arguments to the program. We
can call this program with the following command line:

> lua prog.lua file.dos file.unix

As another example, the following program prints all strings found in a binary file:

local f = assert(io.open(arg[1], "rb"))


local data = f:read("a")
local validchars = "[%g%s]"
local pattern = "(" .. string.rep(validchars, 6) .. "+)\0"
for w in string.gmatch(data, pattern) do
print(w)
end

The program assumes that a string is any zero-terminated sequence of six or more valid characters, where a
valid character is any character accepted by the pattern validchars. In our example, this pattern com-
prises the printable characters. We use string.rep and concatenation to create a pattern that matches
all sequences of six or more validchars ended by a zero. The parentheses in the pattern capture the
string without the zero.

Our last example is a program to make a dump of a binary file, showing its contents in hexadecimal.
Figure 13.2, “Dumping the dump program” shows the result of applying this program to itself on a POSIX
machine.

Figure 13.2. Dumping the dump program


6C 6F 63 61 6C 20 66 20 3D 20 61 73 73 65 72 74 local f = assert
28 69 6F 2E 6F 70 65 6E 28 61 72 67 5B 31 5D 2C (io.open(arg[1],
20 22 72 62 22 29 29 0A 6C 6F 63 61 6C 20 62 6C "rb")).local bl
6F 63 6B 73 69 7A 65 20 3D 20 31 36 0A 66 6F 72 ocksize = 16.for
20 62 79 74 65 73 20 69 6E 20 66 3A 6C 69 6E 65 bytes in f:line
...
25 63 22 2C 20 22 2E 22 29 0A 20 20 69 6F 2E 77 %c", "."). io.w
72 69 74 65 28 22 20 22 2C 20 62 79 74 65 73 2C rite(" ", bytes,
20 22 5C 6E 22 29 0A 65 6E 64 0A 0A "\n").end..

The complete program is here:

local f = assert(io.open(arg[1], "rb"))

102
Bits and Bytes

local blocksize = 16
for bytes in f:lines(blocksize) do
for i = 1, #bytes do
local b = string.unpack("B", bytes, i)
io.write(string.format("%02X ", b))
end
io.write(string.rep(" ", blocksize - #bytes))
bytes = string.gsub(bytes, "%c", ".")
io.write(" ", bytes, "\n")
end

Again, the first program argument is the input file name; the output is regular text, so it can go to the
standard output. The program reads the file in chunks of 16 bytes. For each chunk, it writes the hexadecimal
representation of each byte, and then it writes the chunk as text, changing control characters to dots. We use
string.rep to fill with blanks the last line (which in general will not have exactly 16 bytes), keeping
the alignment.

Exercises
Exercise 13.1: Write a function to compute the modulo operation for unsigned integers.

Exercise 13.2: Implement different ways to compute the number of bits in the representation of a Lua
integer.

Exercise 13.3: How can you test whether a given integer is a power of two?

Exercise 13.4: Write a function to compute the Hamming weight of a given integer. (The Hamming weight
of a number is the number of ones in its binary representation.)

Exercise 13.5: Write a function to test whether the binary representation of a given integer is a palindrome.

Exercise 13.6: Implement a bit array in Lua. It should support the following operations:

• newBitArray(n) (creates an array with n bits),

• setBit(a, n, v) (assigns the Boolean value v to bit n of array a),

• testBit(a, n) (returns a Boolean with the value of bit n).

Exercise 13.7: Suppose we have a binary file with a sequence of records, each one with the following
format:

struct Record {
int x;
char[3] code;
float value;
};

Write a program that reads that file and prints the sum of the value fields.

103
Chapter 14. Data Structures
Tables in Lua are not a data structure; they are the data structure. We can represent all structures that other
languages offer —arrays, records, lists, queues, sets— with tables in Lua. Moreover, Lua tables implement
all these structures efficiently.

In more traditional languages, such as C and Pascal, we implement most data structures with arrays and
lists (where lists = records + pointers). Although we can implement arrays and lists using Lua tables (and
sometimes we do this), tables are more powerful than arrays and lists; many algorithms are simplified to
the point of triviality with the use of tables. For instance, we seldom write a search in Lua, because tables
offer direct access to any type.

It takes a while to learn how to use tables efficiently. Here, we will see how to implement typical data
structures with tables and cover some examples of their uses. We will start with arrays and lists, not because
we need them for the other structures, but because most programmers are already familiar with them. (We
have already seen the basics of this material in Chapter 5, Tables, but I will repeat it here for completeness.)
Then we will continue with more advanced examples, such as sets, bags, and graphs.

Arrays
We implement arrays in Lua simply by indexing tables with integers. Therefore, arrays do not have a fixed
size, but grow as needed. Usually, when we initialize the array we define its size indirectly. For instance,
after the following code, any attempt to access a field outside the range 1–1000 will return nil, instead
of zero:

local a = {} -- new array


for i = 1, 1000 do
a[i] = 0
end

The length operator (#) uses this fact to find the size of an array:

print(#a) --> 1000

We can start an array at index zero, one, or any other value:

-- create an array with indices from -5 to 5


a = {}
for i = -5, 5 do
a[i] = 0
end

However, it is customary in Lua to start arrays with index one. The Lua libraries adhere to this convention;
so does the length operator. If our arrays do not start with one, we will not be able to use these facilities.

We can use a constructor to create and initialize arrays in a single expression:

squares = {1, 4, 9, 16, 25, 36, 49, 64, 81}

Such constructors can be as large as we need. In Lua, it is not uncommon data-description files with
constructors with a few million elements.

104
Data Structures

Matrices and Multi-Dimensional Arrays


There are two main ways to represent matrices in Lua. The first one is with a jagged array (an array of
arrays), that is, a table wherein each element is another table. For instance, we can create a matrix of zeros
with dimensions N by M with the following code:

local mt = {} -- create the matrix


for i = 1, N do
local row = {} -- create a new row
mt[i] = row
for j = 1, M do
row[j] = 0
end
end

Because tables are objects in Lua, we have to create each row explicitly to build a matrix. On the one
hand, this is certainly more verbose than simply declaring a matrix, as we do in C. On the other hand, it
gives us more flexibility. For instance, we can create a triangular matrix by changing the inner loop in the
previous example to for j=1,i do ... end. With this code, the triangular matrix uses only half
the memory of the original one.

The second way to represent a matrix is by composing the two indices into a single one. Typically, we
do this by multiplying the first index by a suitable constant and then adding the second index. With this
approach, the following code would create our matrix of zeros with dimensions N by M:

local mt = {} -- create the matrix


for i = 1, N do
local aux = (i - 1) * M
for j = 1, M do
mt[aux + j] = 0
end
end

Quite often, applications use a sparse matrix, a matrix wherein most elements are zero or nil. For instance,
we can represent a graph by its adjacency matrix, which has the value x in position (m,n) when there is a
connection with cost x between nodes m and n. When these nodes are not connected, the value in position
(m,n) is nil. To represent a graph with ten thousand nodes, where each node has about five neighbors,
we will need a matrix with a hundred million entries (a square matrix with 10000 columns and 10000
rows), but approximately only fifty thousand of them will not be nil (five non-nil columns for each row,
corresponding to the five neighbors of each node). Many books on data structures discuss at length how to
implement such sparse matrices without wasting 800 MB of memory, but we seldom need these techniques
when programming in Lua. Because we represent arrays with tables, they are naturally sparse. With our
first representation (tables of tables), we will need ten thousand tables, each one with about five elements,
with a grand total of fifty thousand entries. With the second representation, we will have a single table,
with fifty thousand entries in it. Whatever the representation, we need space only for the non-nil elements.

We cannot use the length operator over sparse matrices, because of the holes (nil values) between active
entries. This is not a big loss; even if we could use it, we probably would not. For most operations, it would
be quite inefficient to traverse all those empty entries. Instead, we can use pairs to traverse only the non-
nil elements. As an example, let us see how to do matrix multiplication with sparse matrices represented
by jagged arrays.

Suppose we want to multiply a matrix a[M,K] by a matrix b[K,N], producing the matrix c[M,N]. The
usual matrix-multiplication algorithm goes like this:

105
Data Structures

for i = 1, M do
for j = 1, N do
c[i][j] = 0
for k = 1, K do
c[i][j] = c[i][j] + a[i][k] * b[k][j]
end
end
end

The two outer loops traverse the entire resulting matrix, and for each element, the inner loop computes
its value.

For sparse matrices with jagged arrays, this inner loop is a problem. Because it traverses a column of b,
instead of a row, we cannot use something like pairs here: the loop has to visit each row looking whether
that row has an element in that column. Instead of visiting only a few non-zero elements, the loop visits
all zero elements, too. (Traversing a column can be an issue in other contexts, too, because of its loss of
spatial locality.)

The following algorithm is quite similar to the previous one, but it reverses the order of the two inner
loops. With this simple change, it avoids traversing columns:

-- assumes 'c' has zeros in all elements


for i = 1, M do
for k = 1, K do
for j = 1, N do
c[i][j] = c[i][j] + a[i][k] * b[k][j]
end
end
end

Now, the middle loop traverses the row a[i], and the inner loop traverses the row b[k]. Both can use
pairs, visiting only the non-zero elements. The initialization of the resulting matrix c is not an issue
here, because an empty sparse matrix is naturally filled with zeros.

Figure 14.1. Multiplication of sparse matrices


function mult (a, b)
local c = {} -- resulting matrix
for i = 1, #a do
local resultline = {} -- will be 'c[i]'
for k, va in pairs(a[i]) do -- 'va' is a[i][k]
for j, vb in pairs(b[k]) do -- 'vb' is b[k][j]
local res = (resultline[j] or 0) + va * vb
resultline[j] = (res ~= 0) and res or nil
end
end
c[i] = resultline
end
return c
end

Figure 14.1, “Multiplication of sparse matrices” shows the complete implementation of the above algo-
rithm, using pairs and taking care of sparse entries. This implementation visits only the non-nil elements,
and the result is naturally sparse. Moreover, the code deletes resulting entries that by chance evaluate to
zero.

106
Data Structures

Linked Lists
Because tables are dynamic entities, it is easy to implement linked lists in Lua. We represent each node
with a table (what else?); links are simply table fields that contain references to other tables. For instance,
let us implement a singly-linked list, where each node has two fields, value and next. A simple variable
is the list root:

list = nil

To insert an element at the beginning of the list, with a value v, we do this:

list = {next = list, value = v}

To traverse the list, we write this:

local l = list
while l do
visit l.value
l = l.next
end

We can also implement easily other kinds of lists, such as doubly-linked lists or circular lists. However, we
seldom need those structures in Lua, because usually there is a simpler way to represent our data without
using linked lists. For instance, we can represent a stack with an (unbounded) array.

Queues and Double-Ended Queues


A simple way to implement queues in Lua is with functions insert and remove from the table
library. As we saw in the section called “The Table Library”, these functions insert and remove elements
in any position of an array, moving other elements to accommodate the operation. However, these moves
can be expensive for large structures. A more efficient implementation uses two indices, one for the first
element and another for the last. With that representation, we can insert or remove an element at both ends
in constant time, as shown in Figure 14.2, “A double-ended queue”.

107
Data Structures

Figure 14.2. A double-ended queue


function listNew ()
return {first = 0, last = -1}
end

function pushFirst (list, value)


local first = list.first - 1
list.first = first
list[first] = value
end

function pushLast (list, value)


local last = list.last + 1
list.last = last
list[last] = value
end

function popFirst (list)


local first = list.first
if first > list.last then error("list is empty") end
local value = list[first]
list[first] = nil -- to allow garbage collection
list.first = first + 1
return value
end

function popLast (list)


local last = list.last
if list.first > last then error("list is empty") end
local value = list[last]
list[last] = nil -- to allow garbage collection
list.last = last - 1
return value
end

If we use this structure in a strict queue discipline, calling only pushLast and popFirst, both first
and last will increase continually. However, because we represent arrays in Lua with tables, we can
index them either from 1 to 20 or from 16777201 to 16777220. With 64-bit integers, such a queue can run
for thirty thousand years, doing ten million insertions per second, before it has problems with overflows.

Reverse Tables
As I said, before, we seldom do searches in Lua. Instead, we use what we call an index table or a reverse
table.

Suppose we have a table with the names of the days of the week:

days = {"Sunday", "Monday", "Tuesday", "Wednesday",


"Thursday", "Friday", "Saturday"}

Now we want to translate a name into its position in the week. We can search the table, looking for the
given name. A more efficient approach, however, is to build a reverse table, say revDays, which has the
names as indices and the numbers as values. This table would look like this:

108
Data Structures

revDays = {["Sunday"] = 1, ["Monday"] = 2,


["Tuesday"] = 3, ["Wednesday"] = 4,
["Thursday"] = 5, ["Friday"] = 6,
["Saturday"] = 7}

Then, all we have to do to find the order of a name is to index this reverse table:

x = "Tuesday"
print(revDays[x]) --> 3

Of course, we do not need to declare the reverse table manually. We can build it automatically from the
original one:

revDays = {}
for k,v in pairs(days) do
revDays[v] = k
end

The loop will do the assignment for each element of days, with the variable k getting the keys (1, 2, ...)
and v the values ("Sunday", "Monday", ...).

Sets and Bags


Suppose we want to list all identifiers used in a program source; for that, we need to filter the reserved
words out of our listing. Some C programmers could be tempted to represent the set of reserved words
as an array of strings and search this array to know whether a given word is in the set. To speed up the
search, they could even use a binary tree to represent the set.

In Lua, an efficient and simple way to represent such sets is to put the set elements as indices in a table.
Then, instead of searching the table for a given element, we just index the table and test whether the result
is nil. In our example, we could write the following code:

reserved = {
["while"] = true, ["if"] = true,
["else"] = true, ["do"] = true,
}

for w in string.gmatch(s, "[%a_][%w_]*") do


if not reserved[w] then
do something with 'w' -- 'w' is not a reserved word
end
end

(In the definition of reserved, we cannot write while = true, because while is not a valid name
in Lua. Instead, we use the notation ["while"] = true.)

We can have a clearer initialization using an auxiliary function to build the set:

function Set (list)


local set = {}
for _, l in ipairs(list) do set[l] = true end
return set
end

reserved = Set{"while", "end", "function", "local", }

109
Data Structures

We can also use another set to collect the identifiers:

local ids = {}
for w in string.gmatch(s, "[%a_][%w_]*") do
if not reserved[w] then
ids[w] = true
end
end

-- print each identifier once


for w in pairs(ids) do print(w) end

Bags, also called multisets, differ from regular sets in that each element can appear multiple times. An
easy representation for bags in Lua is similar to the previous representation for sets, but with a counter
associated with each key.1 To insert an element, we increment its counter:

function insert (bag, element)


bag[element] = (bag[element] or 0) + 1
end

To remove an element, we decrement its counter:

function remove (bag, element)


local count = bag[element]
bag[element] = (count and count > 1) and count - 1 or nil
end

We only keep the counter if it already exists and it is still greater than zero.

String Buffers
Suppose we are building a string piecemeal, for instance reading a file line by line. Our typical code
could look like this:

local buff = ""


for line in io.lines() do
buff = buff .. line .. "\n"
end

Despite its innocent look, this code in Lua can cause a huge performance penalty for large files: for instance,
it takes more than 30 seconds to read a 4.5 MB file on my new machine.

Why is that? To understand what happens, let us imagine that we are in the middle of the read loop; each
line has 20 bytes and we have already read some 2500 lines, so buff is a string with 50 kB. When Lua
concatenates buff..line.."\n", it allocates a new string with 50020 bytes and copies the 50000
bytes from buff into this new string. That is, for each new line, Lua moves around 50 kB of memory, and
growing. The algorithm is quadratic. After reading 100 new lines (only 2 kB), Lua has already moved more
than 5 MB of memory. When Lua finishes reading 350 kB, it has moved around more than 50 GB. (This
problem is not peculiar to Lua: other languages wherein strings are immutable values present a similar
behavior, Java being a famous example.)

Before we continue, we should remark that, despite all I said, this situation is not a common problem. For
small strings, the above loop is fine. To read an entire file, Lua provides the io.read("a") option,
1
We already used this representation for the most-frequent-words program in Chapter 11, Interlude: Most Frequent Words.

110
Data Structures

which reads the file at once. However, sometimes we must face this problem. Java offers the String-
Buffer class to ameliorate the problem. In Lua, we can use a table as the string buffer. The key to this
approach is the function table.concat, which returns the concatenation of all the strings of a given
list. Using concat, we can write our previous loop as follows:

local t = {}
for line in io.lines() do
t[#t + 1] = line .. "\n"
end
local s = table.concat(t)

This algorithm takes less than 0.05 seconds to read the same file that took more than half a minute to read
with the original code. (Nevertheless, for reading a whole file it is still better to use io.read with the
"a" option.)

We can do even better. The function concat accepts an optional second argument, which is a separator
to be inserted between the strings. Using this separator, we do not need to insert a newline after each line:

local t = {}
for line in io.lines() do
t[#t + 1] = line
end
s = table.concat(t, "\n") .. "\n"

The function inserts the separator between the strings, but we still have to add the last newline. This last
concatenation creates a new copy of the resulting string, which can be quite long. There is no option to
make concat insert this extra separator, but we can deceive it, inserting an extra empty string in t:

t[#t + 1] = ""
s = table.concat(t, "\n")

Now, the extra newline that concat adds before this empty string is at the end of the resulting string,
as we wanted.

Graphs
Like any decent language, Lua allows multiple implementations for graphs, each one better adapted to
some particular algorithms. Here we will see a simple object-oriented implementation, where we represent
nodes as objects (actually tables, of course) and arcs as references between nodes.

We will represent each node as a table with two fields: name, with the node's name; and adj, with the
set of nodes adjacent to this one. Because we will read the graph from a text file, we need a way to find
a node given its name. So, we will use an extra table mapping names to nodes. Given a name, function
name2node returns the corresponding node:

local function name2node (graph, name)


local node = graph[name]
if not node then
-- node does not exist; create a new one
node = {name = name, adj = {}}
graph[name] = node
end
return node
end

111
Data Structures

Figure 14.3, “Reading a graph from a file” shows the function that builds a graph.

Figure 14.3. Reading a graph from a file


function readgraph ()
local graph = {}
for line in io.lines() do
-- split line in two names
local namefrom, nameto = string.match(line, "(%S+)%s+(%S+)")
-- find corresponding nodes
local from = name2node(graph, namefrom)
local to = name2node(graph, nameto)
-- adds 'to' to the adjacent set of 'from'
from.adj[to] = true
end
return graph
end

It reads a file where each line has two node names, meaning that there is an arc from the first node to the
second. For each line, the function uses string.match to split the line in two names, finds the nodes
corresponding to these names (creating the nodes if needed), and connects the nodes.

Figure 14.4, “Finding a path between two nodes” illustrates an algorithm using such graphs.

Figure 14.4. Finding a path between two nodes


function findpath (curr, to, path, visited)
path = path or {}
visited = visited or {}
if visited[curr] then -- node already visited?
return nil -- no path here
end
visited[curr] = true -- mark node as visited
path[#path + 1] = curr -- add it to path
if curr == to then -- final node?
return path
end
-- try all adjacent nodes
for node in pairs(curr.adj) do
local p = findpath(node, to, path, visited)
if p then return p end
end
table.remove(path) -- remove node from path
end

The function findpath searches for a path between two nodes using a depth-first traversal. Its first
parameter is the current node; the second is its goal; the third parameter keeps the path from the origin to
the current node; the last parameter is a set with all the nodes already visited, to avoid loops. Note how
the algorithm manipulates nodes directly, without using their names. For instance, visited is a set of
nodes, not of node names. Similarly, path is a list of nodes.

To test this code, we add a function to print a path and some code to put it all to work:

function printpath (path)


for i = 1, #path do

112
Data Structures

print(path[i].name)
end
end

g = readgraph()
a = name2node(g, "a")
b = name2node(g, "b")
p = findpath(a, b)
if p then printpath(p) end

Exercises
Exercise 14.1: Write a function to add two sparse matrices.

Exercise 14.2: Modify the queue implementation in Figure 14.2, “A double-ended queue” so that both
indices return to zero when the queue is empty.

Exercise 14.3: Modify the graph structure so that it can keep a label for each arc. The structure should
represent each arc by an object, too, with two fields: its label and the node it points to. Instead of an adjacent
set, each node keeps an incident set that contains the arcs that originate at that node.

Adapt the function readgraph to read two node names plus a label from each line in the input file.
(Assume that the label is a number.)

Exercise 14.4: Assume the graph representation of the previous exercise, where the label of each arc
represents the distance between its end nodes. Write a function to find the shortest path between two given
nodes, using Dijkstra's algorithm.

113
Chapter 15. Data Files and Serialization
When dealing with data files, it is usually much easier to write the data than to read it back. When we write
a file, we have full control of what is going on. When we read a file, on the other hand, we do not know what
to expect. Besides all the kinds of data that a correct file can contain, a robust program should also handle
bad files gracefully. Therefore, coding robust input routines is always difficult. In this chapter, we will see
how we can use Lua to eliminate all code for reading data from our programs, simply by writing the data in
an appropriate format. More specifically, we write data as Lua programs that, when run, rebuild the data.

Data description has been one of the main applications of Lua since its creation in 1993. At that time,
the main alternative for a textual data-description language would be SGML. For many people (including
us), SGML is bloated and complex. In 1998, some people simplified it to create XML, which in our view
is still bloated and complex. Other people shared our view, and some of them created JSON (in 2001).
JSON is based on Javascript and quite similar to restricted Lua data files. On the one hand, JSON has a
big advantage of being an international standard, and several languages (including Lua) have libraries to
manipulate JSON files. On the other hand, Lua files are trivial to read and more flexible.

Using a full programming language for data description is surely flexible, but it brings two problems. One
is security, as “data” files can run amok inside our program. We can solve that by running the file in a
sandbox, which we will discuss in the section called “Sandboxing”.

The other problem is performance. Lua not only runs fast, but it also compiles fast. For instance, in my
new machine, Lua 5.3 reads, compiles, and runs a program with ten million assignments in four seconds,
using 240 MB. For comparison, Perl 5.18 takes 21 seconds and 6 GB, Python 2.7 and Python 3.4 trash the
machine, Node.js 0.10.25 gives an “out of memory” error after eight seconds, and Rhino 1.7 also gives
an “out of memory” error, after six minutes.

Data Files
Table constructors provide an interesting alternative for file formats. With a little extra work when writing
data, reading becomes trivial. The technique is to write our data file as Lua code that, when run, rebuilds
the data into the program. With table constructors, these chunks can look remarkably like a plain data file.

Let us see an example to make things clear. If our data file is in a predefined format, such as CSV (Com-
ma-Separated Values) or XML, we have little choice. However, if we are going to create the file for our
own use, we can use Lua constructors as our format. In this format, we represent each data record as a Lua
constructor. Instead of writing in our data file something like

Donald E. Knuth,Literate Programming,CSLI,1992


Jon Bentley,More Programming Pearls,Addison-Wesley,1990

we write this:

Entry{"Donald E. Knuth",
"Literate Programming",
"CSLI",
1992}

Entry{"Jon Bentley",
"More Programming Pearls",
"Addison-Wesley",
1990}

Remember that Entry{code} is the same as Entry({code}), that is, a call to some function Entry
with a table as its single argument. So, that previous piece of data is a Lua program. To read that file, we

114
Data Files and Serialization

only need to run it, with a sensible definition for Entry. For instance, the following program counts the
number of entries in a data file:

local count = 0
function Entry () count = count + 1 end
dofile("data")
print("number of entries: " .. count)

The next program collects in a set the names of all authors found in the file, and then prints them:

local authors = {} -- a set to collect authors


function Entry (b) authors[b[1]] = true end
dofile("data")
for name in pairs(authors) do print(name) end

Note the event-driven approach in these program fragments: the function Entry acts as a callback func-
tion, which is called during the dofile for each entry in the data file.

When file size is not a big concern, we can use name-value pairs for our representation:1

Entry{
author = "Donald E. Knuth",
title = "Literate Programming",
publisher = "CSLI",
year = 1992
}

Entry{
author = "Jon Bentley",
title = "More Programming Pearls",
year = 1990,
publisher = "Addison-Wesley",
}

This format is what we call a self-describing data format, because each piece of data has attached to it a
short description of its meaning. Self-describing data are more readable (by humans, at least) than CSV
or other compact notations; they are easy to edit by hand, when necessary; and they allow us to make
small modifications in the basic format without having to change the data file. For instance, if we add a
new field we need only a small change in the reading program, so that it supplies a default value when
the field is absent.

With the name-value format, our program to collect authors becomes this:

local authors = {} -- a set to collect authors


function Entry (b) authors[b.author] = true end
dofile("data")
for name in pairs(authors) do print(name) end

Now the order of fields is irrelevant. Even if some entries do not have an author, we have to adapt only
the function Entry:

function Entry (b)


authors[b.author or "unknown"] = true
end
1
If this format reminds you of BibTeX, it is not a coincidence. BibTeX was one of the inspirations for the constructor syntax in Lua.

115
Data Files and Serialization

Serialization
Frequently we need to serialize some data, that is, to convert the data into a stream of bytes or characters,
so that we can save it into a file or send it through a network connection. We can represent serialized data
as Lua code in such a way that, when we run the code, it reconstructs the saved values into the reading
program.

Usually, if we want to restore the value of a global variable, our chunk will be something like varname
= exp, where exp is the Lua code to create the value. The varname is the easy part, so let us see how
to write the code that creates a value. For a numeric value, the task is easy:

function serialize (o)


if type(o) == "number" then
io.write(tostring(o))
else other cases
end
end

By writing a float in decimal format, however, we risk losing some precision. We can use a hexadecimal
format to avoid this problem. With format ("%a"), the read float will have exactly the same bits of the
original one. Moreover, since Lua 5.3 we should distinguish between integers and floats, so that they can
be restored with the correct subtype:

local fmt = {integer = "%d", float = "%a"}

function serialize (o)


if type(o) == "number" then
io.write(string.format(fmt[math.type(o)], o))
else other cases

For a string value, a naive approach would be something like this:

if type(o) == "string" then


io.write("'", o, "'")

However, if the string contains special characters (such as quotes or newlines) the resulting code will not
be a valid Lua program.

You may be tempted to solve this problem changing quotes:

if type(o) == "string" then


io.write("[[", o, "]]")

Beware of code injection! If a malicious user manages to direct your program to save something like
" ]]..os.execute('rm *')..[[ " (for instance, she can supply this string as her address), your
final chunk will be like this one:

varname = [[ ]]..os.execute('rm *')..[[ ]]

You will have a bad surprise trying to load this “data”.

A simple way to quote a string in a secure way is with the option "%q" from string.format. This
option was designed to save the string in a way that it can be safely read back by Lua. It surrounds the
string with double quotes and properly escapes double quotes, newlines, and some other characters inside
the string:

a = 'a "problematic" \\string'

116
Data Files and Serialization

print(string.format("%q", a)) --> "a \"problematic\" \\string"

Using this feature, our serialize function now looks like this:

function serialize (o)


if type(o) == "number" then
io.write(string.format(fmt[math.type(o)], o))
elseif type(o) == "string" then
io.write(string.format("%q", o))
else other cases
end
end

Lua 5.3.3 extended the format option "%q" to work also with numbers (plus nil and Booleans), again
writing them in a proper way to be read back by Lua. (In particular, it formats floats in hexadecimal, to
ensure full precision.) Thus, since that version, we can simplify and extend serialize even more:

function serialize (o)


local t = type(o)
if t == "number" or t == "string" or t == "boolean" or
t == "nil" then
io.write(string.format("%q", o))
else other cases
end
end

Another way to save strings is the notation [=[...]=] for long strings. However, this notation is mainly
intended for hand-written code, where we do not want to change a literal string in any way. In automatically
generated code, it is easier to escape problematic characters, as the option "%q" from string.format
does.

If you nevertheless want to use the long-string notation for automatically generated code, you must take
care of some details. The first one is that you must choose a proper number of equals signs. A good proper
number is one more than the maximum that appears in the original string. Because strings containing long
sequences of equals signs are common (e.g., comments delimiting parts of a source code), we should limit
our attention to sequences of equals signs enclosed by square bracket. The second detail is that Lua always
ignores a newline at the beginning of a long string; a simple way to avoid this problem is to add always
a newline to be ignored.

The function quote in Figure 15.1, “Quoting arbitrary literal strings” is the result of our previous remarks.

Figure 15.1. Quoting arbitrary literal strings


function quote (s)
-- find maximum length of sequences of equals signs
local n = -1
for w in string.gmatch(s, "]=*%f[%]]") do
n = math.max(n, #w - 1) -- -1 to remove the ']'
end

-- produce a string with 'n' plus one equals signs


local eq = string.rep("=", n + 1)

-- build quoted string


return string.format(" [%s[\n%s]%s] ", eq, s, eq)
end

117
Data Files and Serialization

It takes an arbitrary string and returns it formatted as a long string. The call to gmatch creates an iterator
to traverse all occurrences of the pattern ']=*%f[%]]' (that is, a closing square bracket followed by a
sequence of zero or more equals signs followed by a frontier with a closing square bracket) in the string s.
For each occurrence, the loop updates n with the maximum number of equals signs so far. After the loop,
we use string.rep to replicate an equals sign n + 1 times, which is one more than the maximum
occurring in the string. Finally, string.format encloses s with pairs of brackets with the correct
number of equals signs in between and adds extra spaces around the quoted string plus a newline at the
beginning of the enclosed string.

(We might be tempted to use the simpler pattern ']=*]', which does not use a frontier pattern for the
second square bracket, but there is a subtlety here. Suppose the subject is "]=]==]". The first match is
"]=]". After it, what is left in the string is "==]", and so there is no other match; in the end of the loop,
n would be one instead of two. The frontier pattern does not consume the bracket, so that it remains in
the subject for the following matches.)

Saving tables without cycles


Our next (and harder) task is to save tables. There are several ways to save them, according to what
assumptions we make about the table structure. No single algorithm seems appropriate for all cases. Simple
tables not only can use simpler algorithms, but also the output can be shorter and clearer.

Our first attempt is in Figure 15.2, “Serializing tables without cycles”.

Figure 15.2. Serializing tables without cycles


function serialize (o)
local t = type(o)
if t == "number" or t == "string" or t == "boolean" or
t == "nil" then
io.write(string.format("%q", o))
elseif t == "table" then
io.write("{\n")
for k,v in pairs(o) do
io.write(" ", k, " = ")
serialize(v)
io.write(",\n")
end
io.write("}\n")
else
error("cannot serialize a " .. type(o))
end
end

Despite its simplicity, that function does a reasonable job. It even handles nested tables (that is, tables
within other tables), as long as the table structure is a tree —that is, there are no shared subtables and no
cycles. (A small aesthetic improvement would be to indent nested tables; see Exercise 15.1.)

The previous function assumes that all keys in a table are valid identifiers. If a table has numeric keys,
or string keys that are not syntactic valid Lua identifiers, we are in trouble. A simple way to solve this
difficulty is to use the following code to write each key:

io.write(string.format(" [%s] = ", serialize(k)))

With this change, we improve the robustness of our function, at the cost of the aesthetics of the resulting
file. Consider the next call:

118
Data Files and Serialization

serialize{a=12, b='Lua', key='another "one"'}

The result of this call using the first version of serialize is this:

{
a = 12,
b = "Lua",
key = "another \"one\"",
}

Compare it with the result using the second version:

{
["a"] = 12,
["b"] = "Lua",
["key"] = "another \"one\"",
}

We can have both robustness and aesthetics by testing for each case whether it needs the square brackets;
again, we will leave this improvement as an exercise.

Saving tables with cycles


To handle tables with generic topology (i.e., with cycles and shared subtables) we need a different ap-
proach. Constructors cannot create such tables, so we will not use them. To represent cycles we need
names, so our next function will get as arguments the value to be saved plus its name. Moreover, we must
keep track of the names of the tables already saved, to reuse them when we detect a cycle. We will use
an extra table for this tracking. This table will have previously saved tables as indices and their names
as the associated values.

The resulting code is in Figure 15.3, “Saving tables with cycles”.

119
Data Files and Serialization

Figure 15.3. Saving tables with cycles


function basicSerialize (o)
-- assume 'o' is a number or a string
return string.format("%q", o)
end

function save (name, value, saved)


saved = saved or {} -- initial value
io.write(name, " = ")
if type(value) == "number" or type(value) == "string" then
io.write(basicSerialize(value), "\n")
elseif type(value) == "table" then
if saved[value] then -- value already saved?
io.write(saved[value], "\n") -- use its previous name
else
saved[value] = name -- save name for next time
io.write("{}\n") -- create a new table
for k,v in pairs(value) do -- save its fields
k = basicSerialize(k)
local fname = string.format("%s[%s]", name, k)
save(fname, v, saved)
end
end
else
error("cannot save a " .. type(value))
end
end

We keep the restriction that the tables we want to save have only strings and numbers as keys. The function
basicSerialize serializes these basic types, returning the result. The next function, save, does the
hard work. The saved parameter is the table that keeps track of tables already saved. As an example,
suppose we build a table like this:

a = {x=1, y=2; {3,4,5}}


a[2] = a -- cycle
a.z = a[1] -- shared subtable

The call save("a", a) will save it as follows:

a = {}
a[1] = {}
a[1][1] = 3
a[1][2] = 4
a[1][3] = 5

a[2] = a
a["y"] = 2
a["x"] = 1
a["z"] = a[1]

The actual order of these assignments may vary, as it depends on a table traversal. Nevertheless, the algo-
rithm ensures that any node needed in a new definition is already defined.

If we want to save several values with shared parts, we can make the calls to save them using the same
saved table. For instance, assume the following two tables:

120
Data Files and Serialization

a = {{"one", "two"}, 3}
b = {k = a[1]}

If we save them independently, the result will not have common parts. However, if we use the same saved
table for both calls to save, then the result will share common parts:

local t = {}
save("a", a, t)
save("b", b, t)

--> a = {}
--> a[1] = {}
--> a[1][1] = "one"
--> a[1][2] = "two"
--> a[2] = 3
--> b = {}
--> b["k"] = a[1]

As is usual in Lua, there are several other alternatives. Among them, we can save a value without giving
it a global name (instead, the chunk builds a local value and returns it), we can use the list syntax when
possible (see the exercises for this chapter), and so on. Lua gives you the tools; you build the mechanisms.

Exercises
Exercise 15.1: Modify the code in Figure 15.2, “Serializing tables without cycles” so that it indents nested
tables. (Hint: add an extra parameter to serialize with the indentation string.)

Exercise 15.2: Modify the code of the previous exercise so that it uses the syntax ["key"]=value, as
suggested in the section called “Saving tables without cycles”.

Exercise 15.3: Modify the code of the previous exercise so that it uses the syntax ["key"]=value only
when necessary (that is, when the key is a string but not a valid identifier).

Exercise 15.4: Modify the code of the previous exercise so that it uses the constructor syntax for lists
whenever possible. For instance, it should serialize the table {14, 15, 19} as {14, 15, 19}, not
as {[1] = 14, [2] = 15, [3] = 19}. (Hint: start by saving the values of the keys 1, 2, ..., as
long as they are not nil. Take care not to save them again when traversing the rest of the table.)

Exercise 15.5: The approach of avoiding constructors when saving tables with cycles is too radical. It is
possible to save the table in a more pleasant format using constructors for the simple case, and to use
assignments later only to fix sharing and loops. Reimplement the function save (Figure 15.3, “Saving
tables with cycles”) using this approach. Add to it all the goodies that you have implemented in the previous
exercises (indentation, record syntax, and list syntax).

121
Chapter 16. Compilation, Execution,
and Errors
Although we refer to Lua as an interpreted language, Lua always precompiles source code to an interme-
diate form before running it. (This is not a big deal: many interpreted languages do the same.) The presence
of a compilation phase may sound out of place in an interpreted language. However, the distinguishing
feature of interpreted languages is not that they are not compiled, but that it is possible (and easy) to exe-
cute code generated on the fly. We may say that the presence of a function like dofile is what entitles
us to call Lua an interpreted language.

In this chapter, we will discuss in more details the process that Lua uses for running its chunks, what
compilation means (and does), how Lua runs that compiled code, and how it handles errors in that process.

Compilation
Previously, we introduced dofile as a kind of primitive operation to run chunks of Lua code, but
dofile is actually an auxiliary function: the function loadfile does the hard work. Like dofile,
loadfile loads a Lua chunk from a file, but it does not run the chunk. Instead, it only compiles the
chunk and returns the compiled chunk as a function. Moreover, unlike dofile, loadfile does not
raise errors, but instead returns error codes. We could define dofile as follows:

function dofile (filename)


local f = assert(loadfile(filename))
return f()
end

Note the use of assert to raise an error if loadfile fails.

For simple tasks, dofile is handy, because it does the complete job in one call. However, loadfile is
more flexible. In case of error, loadfile returns nil plus the error message, which allows us to handle
the error in customized ways. Moreover, if we need to run a file several times, we can call loadfile
once and call its result several times. This approach is much cheaper than several calls to dofile, because
it compiles the file only once. (Compilation is a somewhat expensive operation when compared to other
tasks in the language.)

The function load is similar to loadfile, except that it reads its chunk from a string or from a function,
not from a file.1 For instance, consider the next line:

f = load("i = i + 1")

After this code, f will be a function that executes i = i + 1 when invoked:

i = 0
f(); print(i) --> 1
f(); print(i) --> 2

The function load is powerful; we should use it with care. It is also an expensive function (when compared
to some alternatives) and can result in incomprehensible code. Before you use it, make sure that there is
no simpler way to solve the problem at hand.
1
In Lua 5.1, function loadstring did the role of load for strings.

122
Compilation, Execution, and Errors

If we want to do a quick-and-dirty dostring (i.e., to load and run a chunk), we can call the result from
load directly:

load(s)()

However, if there is any syntax error, load will return nil and the final error message will be something
like “attempt to call a nil value”. For clearer error messages, it is better to use assert:

assert(load(s))()

Usually, it does not make sense to use load on a literal string. For instance, the next two lines are roughly
equivalent:

f = load("i = i + 1")

f = function () i = i + 1 end

However, the second line is much faster, because Lua compiles the function together with its enclosing
chunk. In the first line, the call to load involves a separate compilation.

Because load does not compile with lexical scoping, the two lines in the previous example may not be
truly equivalent. To see the difference, let us change the example a little:

i = 32
local i = 0
f = load("i = i + 1; print(i)")
g = function () i = i + 1; print(i) end
f() --> 33
g() --> 1

The function g manipulates the local i, as expected, but f manipulates a global i, because load always
compiles its chunks in the global environment.

The most typical use of load is to run external code (that is, pieces of code that come from outside our
program) or dynamically-generated code. For instance, we may want to plot a function defined by the user;
the user enters the function code and then we use load to evaluate it. Note that load expects a chunk,
that is, statements. If we want to evaluate an expression, we can prefix the expression with return, so that
we get a statement that returns the value of the given expression. See the example:

print "enter your expression:"


local line = io.read()
local func = assert(load("return " .. line))
print("the value of your expression is " .. func())

Because the function returned by load is a regular function, we can call it several times:

print "enter function to be plotted (with variable 'x'):"


local line = io.read()
local f = assert(load("return " .. line))
for i = 1, 20 do
x = i -- global 'x' (to be visible from the chunk)
print(string.rep("*", f()))
end

We can call load also with a reader function as its first argument. A reader function can return the
chunk in parts; load calls the reader successively until it returns nil, which signals the chunk's end. As
an example, the next call is equivalent to loadfile:

123
Compilation, Execution, and Errors

f = load(io.lines(filename, "*L"))

As we saw in Chapter 7, The External World, the call io.lines(filename, "*L") returns a function
that, at each call, returns a new line from the given file. So, load will read the chunk from the file line
by line. The following version is similar, but slightly more efficient:

f = load(io.lines(filename, 1024))

Here, the iterator returned by io.lines reads the file in blocks of 1024 bytes.

Lua treats any independent chunk as the body of an anonymous variadic function. For instance, load("a
= 1") returns the equivalent of the following expression:

function (...) a = 1 end

Like any other function, chunks can declare local variables:

f = load("local a = 10; print(a + 20)")


f() --> 30

Using these features, we can rewrite our plot example to avoid the use of a global variable x:

print "enter function to be plotted (with variable 'x'):"


local line = io.read()
local f = assert(load("local x = ...; return " .. line))
for i = 1, 20 do
print(string.rep("*", f(i)))
end

In this code, we append the declaration "local x = ..." at the beginning of the chunk to declare x as
a local variable. We then call f with an argument i that becomes the value of the vararg expression (...).

The functions load and loadfile never raise errors. In case of any kind of error, they return nil plus
an error message:

print(load("i i"))
--> nil [string "i i"]:1: '=' expected near 'i'

Moreover, these functions never have any kind of side effect, that is, they do not change or create variables,
do not write to files, etc. They only compile the chunk to an internal representation and return the result as
an anonymous function. A common mistake is to assume that loading a chunk defines functions. In Lua,
function definitions are assignments; as such, they happen at runtime, not at compile time. For instance,
suppose we have a file foo.lua like this:

-- file 'foo.lua'
function foo (x)
print(x)
end

We then run the command

f = loadfile("foo.lua")

This command compiles foo but does not define it. To define it, we must run the chunk:

f = loadfile("foo.lua")
print(foo) --> nil
f() -- run the chunk
foo("ok") --> ok

124
Compilation, Execution, and Errors

This behavior may sound strange, but it becomes clear if we rewrite the file without the syntax sugar:

-- file 'foo.lua'
foo = function (x)
print(x)
end

In a production-quality program that needs to run external code, we should handle any errors reported
when loading a chunk. Moreover, we may want to run the new chunk in a protected environment, to avoid
unpleasant side effects. We will discuss environments in detail in Chapter 22, The Environment.

Precompiled Code
As I mentioned in the beginning of this chapter, Lua precompiles source code before running it. Lua also
allows us to distribute code in precompiled form.

The simplest way to produce a precompiled file —also called a binary chunk in Lua jargon— is with
the luac program that comes in the standard distribution. For instance, the next call creates a new file
prog.lc with a precompiled version of a file prog.lua:

$ luac -o prog.lc prog.lua

The Lua interpreter can execute this new file just like normal Lua code, performing exactly as it would
with the original source:

$ lua prog.lc

Lua accepts precompiled code mostly anywhere it accepts source code. In particular, both loadfile
and load accept precompiled code.

We can write a minimal luac directly in Lua:

p = loadfile(arg[1])
f = io.open(arg[2], "wb")
f:write(string.dump(p))
f:close()

The key function here is string.dump: it receives a Lua function and returns its precompiled code as
a string, properly formatted to be loaded back by Lua.

The luac program offers some other interesting options. In particular, option -l lists the opcodes that
the compiler generates for a given chunk. As an example, Figure 16.1, “Example of output from luac -
l” shows the output of luac with option -l on the following one-line file:

a = x + y - z

Figure 16.1. Example of output from luac -l


main <stdin:0,0> (7 instructions, 28 bytes at 0x988cb30)
0+ params, 2 slots, 0 upvalues, 0 locals, 4 constants, 0 functions
1 [1] GETGLOBAL 0 -2 ; x
2 [1] GETGLOBAL 1 -3 ; y
3 [1] ADD 0 0 1
4 [1] GETGLOBAL 1 -4 ; z
5 [1] SUB 0 0 1
6 [1] SETGLOBAL 0 -1 ; a
7 [1] RETURN 0 1

125
Compilation, Execution, and Errors

(We will not discuss the internals of Lua in this book; if you are interested in more details about those
opcodes, a Web search for "lua opcode" should give you relevant material.)

Code in precompiled form is not always smaller than the original, but it loads faster. Another benefit is
that it gives a protection against accidental changes in sources. Unlike source code, however, maliciously
corrupted binary code can crash the Lua interpreter or even execute user-provided machine code. When
running usual code, there is nothing to worry about. However, you should avoid running untrusted code
in precompiled form. The function load has an option exactly for this task.

Besides its required first argument, load has three more arguments, all of them optional. The second is a
name for the chunk, used only in error messages. The fourth argument is an environment, which we will
discuss in Chapter 22, The Environment. The third argument is the one we are interested here; it controls
what kinds of chunks can be loaded. If present, this argument must be a string: the string "t" allows
only textual (normal) chunks; "b" allows only binary (precompiled) chunks; "bt", the default, allows
both formats.

Errors
Errare humanum est. Therefore, we must handle errors the best way we can. Because Lua is an extension
language, frequently embedded in an application, it cannot simply crash or exit when an error happens.
Instead, whenever an error occurs, Lua must offer ways to handle it.

Any unexpected condition that Lua encounters raises an error. Errors occur when a program tries to add
values that are not numbers, call values that are not functions, index values that are not tables, and so on.
(We can modify this behavior using metatables, as we will see later.) We can also explicitly raise an error
calling the function error, with an error message as an argument. Usually, this function is the appropriate
way to signal errors in our code:

print "enter a number:"


n = io.read("n")
if not n then error("invalid input") end

This construction of calling error subject to some condition is so common that Lua has a built-in function
just for this job, called assert:

print "enter a number:"


n = assert(io.read("*n"), "invalid input")

The function assert checks whether its first argument is not false and simply returns this argument;
if the argument is false, assert raises an error. Its second argument, the message, is optional. Beware,
however, that assert is a regular function. As such, Lua always evaluates its arguments before calling
the function. If we write something like

n = io.read()
assert(tonumber(n), "invalid input: " .. n .. " is not a number")

Lua will always do the concatenation, even when n is a number. It may be wiser to use an explicit test
in such cases.

When a function finds an unexpected situation (an exception), it can assume two basic behaviors: it can
return an error code (typically nil or false) or it can raise an error, calling error. There are no fixed
rules for choosing between these two options, but I use the following guideline: an exception that is easily
avoided should raise an error; otherwise, it should return an error code.

126

You might also like