Searching for the contents of many files in Linux

One thing that I find remarkable in Linux is the vast array of clever tools that allow you to do ‘clever stuff’.

The problem exists where these obscure tools are difficult to work out how to use effectively, and moreover, use right.

One task I aim to do occasionally is search a disk for files containing particular text. This is always difficult. Even in Windows XP – the search indexing service actually prevents Windows from searching for specific text within a file. I’ve searched for files that I know have certain strings in them, only for Windows to tell me that it cannot find them.

As my love of Linux and Ubuntu grows, I found myself needing to perform this task again. Recently, I’ve scraped by using the rather useful find tool:

<code>find . -type f -name foobar</code>

What this little snippet does is seach the current directory and all subdirectories for any files containing the word foobar in the name. So it could return names such as ./foobar.doc, ./test/foobar.doc, or ./this is a foobar file name.txt. Pretty useful.

Tonight I needed to search for a specific phrase in a Word document on a disk. This is where the Linux command line really becomes powerful:

<code>find . -type f -name *.doc -print0 | xargs -0 grep -i 'foo bar'</code>

This will join the power of two commands: grep and find to create a groovy search.

First of all, the find command is searching for all files (-type f) that end in .doc (-name *.doc) in all folders starting from the folder I am in.

Once find finds a match, we use the pipe (the |) to pass that file name over to grep, which will search the file for the string foo bar.

We have to use the -print0 and -0 options to make sure that find and grep share the file names correctly between them in case we find any unusual ones (files with spaces would be counted as unusual).

Finally, the -i tells grep that the search is case-insensitive. This means that any .doc file with foo bar, Foo Bar, fOO bAR or any other case variation will be caught. Without it, only the exact string will be matched.

Now go forth and search!