Pete's UNIX/Linux Shell Cookbook

Filtering lists of files using find

List all files in directory foobar not owned by pwb, using a long format:

find foobar/ -user pwb -o -print0 | xargs -0 ls -ld

This illustrates typical use of find with xargs. find prints the filenames that it matches to standard output, piped into xargs, which turns its standard input into command line arguments for the given command. Here xargs has been told to expect arguments passed to it to be delimited by null characters (with the -0 switch), instead of interpreting them in the same way as the shell does, and find has been told to produce output in this format (using -print0 instead of -print). This preserves characters in the filename which you would otherwise have to escape using an awk or perl script, such as whitespace.

Note also the use of -o, meaning "or"; the expression uses shortcut evaluation, so the expression on the right (the action -print0, in this case) is only evaluated if the left side is false, essentially negating the condition.

Another way to do this would be to use the -exec argument to find:

find foobar/ -user pwb -o -exec ls -ld {} \;

(Note that the ; at the end needs to be protected from the shell by escaping it.) This is subtly different to the above, in that instead of writing the list to standard output and feeding the entire list to the command, it executes the command once for each filename, substituting the {} for the filename. This may cause different output with some commands, for example ls -l (which will align columns differently), and sort -m (which will break completely). Also, the syntax required to do this is less uniform than using xargs, which could make the pipeline hard to understand.

To let your file lists work correctly with line-based tools like grep, it's strongly recommended not to put newline characters in your filenames. For that matter, it's recommended to use only non-special, non-whitespace printable characters in your filenames. That is, don't use any of !"${}[]; , and also avoid names like if, do, done etc. that the shell might try to interpret as special words. Your shell's manual page can tell you what these are (try man bash). And only if you want to be very evil should you create a file called "-rf /", which you can delete only with exreme care :)

Remember that find does not dereference symbolic links, partly in order to avoid directory loops.

Sorting tabulated data using cut and sort

List the real names of all users using /bin/csh as their shell, in alphabetical order:

grep ":/bin/csh$" < /etc/passwd | cut -d: -f5 | cut -d, -f1 | sort -d

The /etc/passwd file is a table, in which each row is on its own line and the 7 columns are separated by colons. The fields we are interested in here are the login shell and the real name, respectively the 7th (final) and 5th fields. The 5th field is usually in fact the user's GECOS information (office, phone number etc.) as well as his real name, the different fields being separated by commas, but we are only interested in his real name here (the first field).

This pipeline starts with a simple grep, which is only possible here because we know where the information we want will be (i.e. at the end). grep simply looks for lines in its standard input which match the regular expression it's given, and prints the lines that match. We then filter these lines through cut, specifying : as the field delimiter (using -d), to remove all but the 5th field (the GECOS info - using -f), and then again with , as the field delimiter to cut out all the GECOS info except the real name. We then sort it by dictionary order (-d).

Using awk

A more general approach you can use, when grep would be awkward to use (e.g. because the information you need is somewhere in the middle of the line), is to use awk, as in the following example, which finds all users whose primary group is staff (gid 50 by default):

awk -F: '$4 ~ /^50$/ {print $5}' < /etc/passwd | cut -d, -f1 | sort -d

(The 4th password field contains the user's primary group id.) awk is a scripting language designed specifically for processing textual data organised in records with fields delimited by a single character, as the password and group files are. (It also accepts arbitrary sequences of spaces and tabs, as for /etc/fstab.) awk automatically creates variables numbered $1 to $NF (where NF is a variable containing the number of fields in the current line) containing each of the fields in the line, and $0 contains the whole line. The -F option specifies a different field separator for input (in this case a colon). $4 ~ /^50$/ is a condition, meaning "field 4 ($4) matches (~) the regular expression /^50$/", and is followed by a script in braces, to be executed when the condition holds. In this case we just print the 5th (GECOS) field. Then we cut out the other GECOS information and sort the list, as before.

Now suppose you want to preserve the whole passwd line. You can have awk swap around some of the fields just by reassigning the $N variables, like this:

awk -F: -v OFS=: '{tmp=$1; $1=$3; $3=tmp; print}' < /etc/yp.passwd | sort -g | awk -F: -v OFS=: '{tmp=$1; $1=$3; $3=tmp; print}'

This swaps the 1st and 3rd fields before filtering through sort -g (general numeric sort), so we are sorting on user ID instead of username. The same script is used to swap them back before outputting. Since we need to output the line with the same delimiter, we need to set the OFS (output field separator) variable to :, which can be done with the -v option.