Tuesday, August 21, 2007

Sorting and Searching

In this month's post-summer holiday column I cover two important topics for Makefile builders with a round up of techniques for sorting and searching inside Makefiles and on the file system.

Sorting a list


Sorting in a Makefile is relatively simple: GNU Make provides a $(sort) function to sort a list. But there are a few of gotchas:

1. The $(sort) both sorts into order and removes duplicates from a list. There's no option to just sort while retaining duplicates.

For example, sorting a b a a c with $(sort) returns a list with three elements (and just one a):

$(warning $(sort a b a a c))

which outputs:

Makefile:1: a b c

More on this in the next section.

2. The actual sort order is defined by the return value from the strcmp C function, which in turn will depend on your operating system and locale collation order. This means that you can't rely on the sort order being identical if the Makefile is used on different operating systems (or even different instances of the same operating system).

3. The sort order is always lexical and there are no options to ignore case, or define another ordering (such as numeric).

If GNU Make's built-in $(sort) function isn't up to the list sorting task that you need then the solution is to use the shell's sort program and call it through $(shell). To use the shell's sort program the list first needs to be broken up into a sequence of lines, then it is passed to sort with the appropriate options, and finally the list is reconstructed. Happily $(shell) will automatically do the list reconstruction for us because the output of sort will be flattened by replacing the end of every line with a space.

To split the list into a sequence of lines you can construct a single echo statement with all the spaces in the list replaced by the newline escape sequence.

The function my-sort below performs a sort using the shell sort program:

sp :=
sp += # add space

my-sort = $(shell echo -e $(subst $(sp),'\n',$2) | sort $1 --key=1,1 -)

my-sort has two arguments: the first are any arguments to pass to the shell's sort program; the second is the list to be sorted.

my-sort works by splitting the list by converting each space to a newline. This is done using the GNU Make $(subst) function that will convert spaces (from the $(sp) variable; since it's hard to pass a space as a function argument in GNU Make, the best way is to create a variable containing just a space and use it instead of a space literal) to newline (note that the newline character is single-quoted and that echo has the -e option: the quotes are there because GNU Make will actually escape the backslash with another backslash and the -e means that the newline escape character will be expanded to a literal newline).

Naturally, there are some problems with this: using the shell is probably slower than using a built-in function because of the need to spawn the shell, and you could hit a command-line length limit if the list to be sorted was long.

Since my-sort uses the shell to sort the list there are a number of options that can come in handy (and are not available by the built-in $(sort)):

-n Perform a numeric rather than alphabetic sort
-M Perform a 'month' sort
-r Reverse the sort order
-f Ignore case

So, for example, you can sort the list 1 12 4 6 8 3 4 5 9 10 alphabetically like this (here I use GNU Make's $(warning) function to print out the result of calling my-sort):

NUMBERS := 1 12 4 6 8 3 4 5 9 10
$(warning $(call my-sort,,$(NUMBERS))

which will output (the Makefile:2: part was generated by $(warning); the rest is the actual sorted list):

Makefile:2: 1 10 12 3 4 4 5 6 8 9

Or you could sort numerically with the -n option:

NUMBERS := 1 12 4 6 8 3 4 5 9 10
$(warning $(call my-sort,-n,$(NUMBERS))

which will output

Makefile:2: 1 3 4 4 5 6 8 9 10 12

Or even numerically, but in reverse order:

NUMBERS := 1 12 4 6 8 3 4 5 9 10
$(warning $(call my-sort,-n -r,$(NUMBERS))

which will output:

Makefile:2: 12 10 9 8 6 5 4 4 3 1


Dealing with duplicates

Since $(sort) always removes duplicates it's unsuitable for use if you need to sort a list but preserve duplicated entries. However, my-sort above works just fine for that purpose. If no options are specified in the first argument of my-sort then duplicates are not removed.

For example, you can sort the list a b a a c with my-sort like this:

DUPS := a b a a c
$(warning $(call my-sort,,$(DUPS)))

and the output is:

Makefile:2: a a a b c

In fact, the only way my-sort will remove duplicates is if the -u option is specified:

DUPS := a b a a c
$(warning $(call my-sort,-u,$(DUPS)))

and the output is:

Makefile:2: a b c

Another way to handle list deduplication is with the GNU Make Standard Library. It includes a function called uniq which removes duplicates from a list without sorting it. The first occurrence of each duplicated element is preserved.

For example, with the GNU Make Standard Library included (see http://gmsl.sf.net/ for more details) using include gmsl, the list b a c a a d can be deduplicated while preserving order:

include gmsl

DUPS := b a c a a d
$(warning $(call uniq,$(DUPS)))

which will output:

Makefile:4: b a c d


Searching for a file

Enough sorting, let's look at searching. First, searching for a file. As with sorting there are two options: use a built-in function and spawn a shell and use a shell program.

The built-in file searching function is called $(wildcard). Its argument is a pattern to search for on the file system and it returns a list of files and directories that match the pattern. It uses the same wild card patterns as the Bourne shell and so it's possible to use *, ? and ranges and character classes.

For example, to return a list of all files ending .c in the current directory you would write $(wildcard *.c). To search all subdirectories of the current directory for .c files you can write $(wildcard */*.c) and so on.

$(wildcard) can also be used to find directories. If the pattern ends with / then the function only returns directories that match the pattern, and each directory will be suffixed with /. For example, to get a list of all the directories in the current directory do $(wildcard */).

There are a few of $(wildcard) gotchas though:

1. $(wildcard) does not provide any way to recursively search a tree of directories.

2. $(wildcard) interacts in a surprising way with GNU Make's built-in directory cache mechanism which can mean that $(wildcard) doesn't reflect the actual state of the file system (for more on this see my article The Trouble with $(wildcard): http://www.cmcrossroads.com/content/view/6487/120/).

3. $(wildcard) doesn't work correctly if the pattern has a space in it. That's because all GNU Make functions assume lists split on spaces and have no special escaping or handling of spaces in paths. The work around for this situation is to replace all spaces in the pattern with the ? globbing character and then call $(wildcard):

sp :=
sp += # add space

space-wildcard = $(wildcard $(subst $(sp),?,$1))

space-wildcard takes a single argument which is exactly the same format as $(wildcard) and replaces any spaces with ? before calling the native $(wildcard). Because of this you cannot call space-wildcard with a list of patterns to search.

Another way to search is to use the shell's find command (in a similar manner to the use of sort above). Since find has many more search options than $(wildcard) it's possible to perform much more flexible searches (including recursive descents).

Here's the definition of my-wildcard:

my-wildcard = $(shell find $1 -name '$2' $3 -print;)

It takes three arguments: the first is the directory to start searching in, the second is the pattern to search for, and the last optional argument are any argument to pass directly to find (for example, if you don't want find to recursive the third argument would be -maxdepth 0).

A simple example, shows how my-wildcard can be used to recursively search from the current directory for all .c files.

$(warning $(call my-wildcard,.,*.c))

If you don't need all the power of find then a simpler replacement function for $(wildcard) can be created just using echo:

simple-wildcard = $(shell echo $1)

It only takes one argument: the pattern to search for, but it overcomes the directory cache limit mentioned above and can be used as a direct replacement for $(wildcard):

Here's how simple-wildcard can be used to find all the .c files in the parent directory with exactly the same syntax as $(wildcard):

$(call simple-wildcard,../*.c)


Recursively searching for a file

Although you can use my-wildcard above (which used the shell find function to perform recursive searches) it would be nice to perform them using only built-in GNU Make functions to avoid the cost of starting a shell with $(shell).

It's possible to perform a recursive $(wildcard) using the following definition that only uses built-in GNU Make functions and never calls the shell:

search = $(foreach d,$(wildcard $1/*),$(call search,$d)$(filter $(subst *,%,$2),$d))

This search function has two arguments: the first is the directory to start the recursive descent in; the second is the pattern to search for. The pattern is not as flexible as $(wildcard) and only supports a single * wild card.

It works by first calling $(wildcard) to get a list of all files and directories (the $(wildcard $1/*)) and then for each one it recurses and calls itself. After the recursion returns to the main function the list of files and directories is matched against the pattern using $(filter).

It's here that the 'single * wild card' restriction is created because $(filter) can only accept a single % wild card. Any * present in the second argument to search is converted to % for $(filter) by the $(subst *,%,$2).

Searching a list of directories for a file

Another common type of file searching is 'in which of these directories can I find file X?' And the most common form of that is searching the path for an executable.

It's fairly easy to define a function, let's call it which-dir, that takes two arguments: a list of directories to search and the name of a file to look for. which-dir will search the list of directories and return the full path each time it finds the requested file. Each found file (and path) will be returned in the same order as the paths in the first argument.

which-dirs = $(wildcard $(addsuffix /$2,$1))

It works by adding the name of the file to search for to each element of the path using $(addsuffix). Since $(wildcard) works even if there are no wild card characters present (in which case it acts as a 'file exists' function, returning the name of the file if it is present, or an empty string if not) and $(wildcard) can accept a list of files to look for in which case it searches for each.

A simple use of which-dirs would be to search the PATH for a particular executable. Here's what happens on my machine looking for emacs on the path:

$(warning $(call which-dirs,$(subst :, ,$(PATH)),emacs))

and it outputs:

Makefile:2: /usr/bin/emacs /usr/bin/X11/emacs
.
Notice that since PATH is a :-separated list I first turned it into a space-separated list for which-dirs by writing $(subst :, ,$(PATH)).

If I wanted to know the first place that emacs is found on the path I'd just take the first element of the list returned by which-dirs using the GNU Make built-in function $(firstword):

$(warning $(firstword $(call which-dirs,$(subst :, ,$(PATH)),emacs)))


Searching for a list member

That's enough file searching. How about searching to see if a list contains a specific member? GNU Make provides two functions that can be used for that purpose: $(filter) and $(filter-out).

To see if a list contains a specific string the $(filter) function is used. If the string is present in the list then $(filter) returns the string found, if it is not present the the function returns an empty string.

For example, to search the list a b dog c for the word dog and the word cat you can do:

LIST := a b dog c

$(warning $(filter dog,$(LIST)))
$(warning $(filter cat,$(LIST)))

and the output will look like:

Makefile:3: dog
Makefile:4:

because there is a dog, but no cat.

$(filter) also supports wild cards; well one wild card: %. So to find whether the same list contains any word starting with d you use the pattern d%:

LIST := a b dog c

$(warning $(filter d%,$(LIST)))

and you'll get the following output:

Makefile:3: dog

$(filter-out) performs the opposite of $(filter): it will return an empty string if the pattern is present, and a non-empty string if it is not. For example, to see if the same list does not contain the word cat you use $(filter-out):

LIST := a b dog c

$(warning $(filter-out cat,$(LIST)))

and you'll get the following output:

Makefile:3: a b dog c

Notice that the output here is actually the entire list (because of the way $(filter-out) works). From GNU Make's perspective any non-empty string is 'true' and the empty string is 'false'. The output of both $(filter) and $(filter-out) can be used directly with ifndef or $(if) as a truth-value.

Searching for a sub-string

Lastly, if you are dealing with a string and not a list, you may need to search for a substring. GNU Make has a built-in function $(findstring) for that purpose.

For example, to see if the substring 'og c' appears in the string 'a b dog c' do the following:

STRING := a b dog c

$(warning $(findstring og c,$(STRING)))

which will output

Makefile:3: og c

If the string had not been present the function would have returned an empty string. Notice how $(findstring) is treating the same text as above as a single string, and not a space-separated list.

Conclusion

That's enough sorting and searching for this month. I'll be back in October with more ways to make GNU Make do the things that aren't mentioned in the manual.

0 komentar:

Powered by WebRing.