Python: find files and search inside them (find & grep)

Introduction

This is the second article in the file find, grep and in-place search/substitute series. It presents the scriptutil.ffindgrep() function that not only helps you find files but also allows you to search inside them.

Test data

In order to play with the scriptutil.py module (syntax highlighted code here) and I will use the same test data as in my previous article i.e. the following example directory tree:

bbox33:scriptutil $ find .
.
./a
./a/a.txt
./a/b
./a/b/b.txt
./a/b/c
./a/b/c/c.txt
./all.doc
./d
./d/d.txt
./d/e
./d/e/e.txt
./o
./o/o.txt
./o/p
./o/p/p.txt
./o/p/q
./o/p/q/q.txt
./o/p/q/r
./o/p/q/r/r.txt
./o/p/q/r/s
./o/p/q/r/s/s.txt

The text files in the tree above were populated with (random) content using the fortune program. The complete test data set may be viewed here.

Find & grep

Now let’s explore the scriptutil.ffindgrep() function and look at examples of how it can be put to good use.

  1 bbox33:scriptutil $ python
  2 Python 2.4.4 (#1, May  9 2007, 11:05:23)
  3 [GCC 4.0.1 (Apple Computer, Inc. build 5367)] on darwin
  4 Type "help", "copyright", "credits" or "license" for more information.
  5  >>> import scriptutil as SU
  6  >>> import re
  7  >>> flist = SU.ffindgrep('.',
  8  ...                      namefs=(lambda s: s.endswith('.txt'),),
  9  ...                      regexl=('there',))
 10  >>> flist
 11 {'./a/a.txt': "\\tIt's the only even prime, therefore it's odd.  QED.",
 12  './o/p/q/q.txt': 'there."'}
 13  >>> SU.printr(flist)
 14 It's the only even prime, therefore it's odd.  QED.
 15 ./a/a.txt
 16 there."
 17 ./o/p/q/q.txt

On lines 7-9 (above) the scriptutil.ffindgrep() function is invoked with the following parameters:

  1. line 7: the path of the directory tree to be searched ('.')
  2. line 8: a tuple with functions (namefs) to use for filtering the files we want (in this instance just a single function that makes sure we’re looking at text files only)
  3. line 9: a tuple with regular expressions (regexl) to filter the contents of the files that passed the name tests (in this example just a single regex picking lines containing the string ‘there’)

The function’s return value is stored in the flist dictionary whose content is shown on lines 11-12. On the next line the scriptutil.printr() helper function is invoked to pretty-print the find results (lines 14-17).
As you can see two text files were found to contain lines with the string of interest.

Please note also how the scriptutil.ffindgrep() function returned a dictionary where each

  • key is the file name
  • value is a string with all the lines found

What if we wanted to look for lines that contain the string ‘there’ as above but do it in a case insensitive way?

This is precisely what I am doing in the example below.

 18  >>> flist = SU.ffindgrep('.',
 19  ...                      namefs=(lambda s: s.endswith('.txt'),),
 20  ...                      regexl=(('there', re.I),))
 21  >>> flist
 22 {'./o/p/q/q.txt': 'there."',
 23  './a/a.txt': "\\tIt's the only even prime, therefore it's odd.  QED.",
 24  './d/e/e.txt': '\\tThere are never enough hours in a day, but always too many days'}
 25  >>> SU.printr(flist)
 26         It's the only even prime, therefore it's odd.  QED.
 27 ./a/a.txt
 28         There are never enough hours in a day, but always too many days
 29 ./d/e/e.txt
 30 there."
 31 ./o/p/q/q.txt

Please note how the regexl parameter on line 20 now contains a 2-tuple with a regex definition string ('there') and a regex compilation flag (re.I) respectively.

Due to the fact that we are ignoring the letter case an additional match is found and displayed on lines 28-29 above.

What follows below is a slightly more advanced example since it uses more than one match pattern (see line 34).
Because the lines we are interested in must satisfy an additional regular expression ('eve') the match shown on lines 30-31 above is gone.

 32  >>> flist = SU.ffindgrep('.',
 33  ...                      namefs=(lambda s: s.endswith('.txt'),),
 34  ...                      regexl=(('there', re.I), 'eve'))
 35  >>> flist
 36 {'./a/a.txt': "\\tIt's the only even prime, therefore it's odd.  QED.",
 37  './d/e/e.txt': '\\tThere are never enough hours in a day, but always too many days'}
 38  >>> SU.printr(flist)
 39         It's the only even prime, therefore it's odd.  QED.
 40 ./a/a.txt
 41         There are never enough hours in a day, but always too many days
 42 ./d/e/e.txt

Please note:

  • The regexl parameter (e.g. on line 34 above) may contain both a simple string (with a regex definition) or a tuple (with parameters accepted by re.compile()).
  • The following regex compilation flags will not have any effect (since the scriptutil.ffindgrep() function matches on a line by line basis): re.S, re.M

Last but not least, I would like to show an example with multiple file name and file content filters: the second function in the namefs tuple (see line 44) now rules out any files with the letter ‘a’ in their path.

 43  >>> flist = SU.ffindgrep('.',
 44  ...             namefs=(lambda s: s.endswith('.txt'), lambda s: 'a' not in s),
 45  ...             regexl=(('there', re.I), 'eve'))
 46  >>> flist
 47 {'./d/e/e.txt': '\\tThere are never enough hours in a day, but always too many days'}
 48  >>> SU.printr(flist)
 49         There are never enough hours in a day, but always too many days
 50 ./d/e/e.txt

Since files with the letter ‘a’ in their path are not acceptable any more the match shown for the text file ./a/a.txt on lines 39-40 above has disappeared.

I hope you liked this introduction to the scriptutil.ffindgrep() function and will find the latter to be a worthy addition to your Python toolchest.

Outlook

In the next article I will present the scriptutil.freplace() function that not only helps you find files but also allows you to search and replace strings inside these.

About these ads

7 thoughts on “Python: find files and search inside them (find & grep)

  1. Very nifty module!

    I find the syntax for the ‘namefs’ argument a bit unwieldy (though very powerful). How about allowing to pass a shell-glob-pattern-style strings as well/instead?

    Example:

    flist = SU.ffindgrep(‘.’, nameglobs=(‘*.txt’, ‘README*’), regexl=(‘there’,))

    Chris

  2. Seconded. I love my shell globs, but I love python *way* more than shell scripting. If you don’t like that style syntax, could you maybe explain what you prefer about your method?

  3. Cristopher, I can see where you are coming from. I will see how much of an effort it would be to revise the code to support shell-glob patterns as well.

  4. @Shawn In most cases you will want to select filenames based on substring matching – shell-glob patterns are a short and sweet syntax for this. And you don’t need to define a function or write a lambda expression to do that. Also, as muharem pointed out, this will be *very* easy to add with the fnmatch module.

    But the possibility to specify a matching function, although a bit cumbersome for common simple cases, is very powerful. It would be very easy to match filenames against a regular expression or look for numerical strings in the filename and match only a certain number range, and so on…

    “Make simple things simple and complex things possible.”

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s