grep puzzler

Here’s a puzzler for all you shell-heads (you know who you are). Normal souls, please move along — nothing interesting to see here.

OK. You’re sitting in the parent of “dirname.” Inside dirname and its children are files that you know contain the string “string.” You want a text file listing the names of all those files. You run:

grep -r “string” dirname > dirname/output.txt

One of two things happens:

1) A few seconds later you have the file listing you need.
2) The command runs forever and output.txt grows indefinitely, until you run out of disk space.

As I discovered the hard way, which of these two occurs depends on which version of grep is installed on your system. In 2.5.x, you get outcome #1. Any version prior to that, you get outcome #2. On closer inspection, it’s easy to see what’s happening — grep is greedy, and is scanning the output file even as the shell is appending grep’s results to it. Reading itself and simultaneously reporting into itself. Devilish. Fortunately I spotted my error before I overflowed the drive. And sending output to any location outside of dirname avoids the problem, of course.

But here’s the puzzler: How was this fixed in grep 2.5? grep is not doing the output redirection — the shell is. grep only knows to pass results to stdout. Beyond that is a black box. So how is grep 2.5 able to avoid the problem of infinite recursion? How was it made aware of what the shell is doing? Cue Twilight Zone intro music.

Freaky deaky, super geeky.

Music: Stranglers :: Get A Grip

9 Replies to “grep puzzler”

Sean Graham says:

July 15, 2004 at 7:55 pm

So you are grep. You have a file open, output.txt, which is the current file in your gigantic list of files you are recursing through. This open file is represented by a File Descriptor.

Every C program also has a set of universal file descriptors that are always available for the standard streams (stdin, stdout, stderr).

Using the fstat(2) function we can stat structures from the two file descriptors. By comparing both the stat.st_dev (inode’s device) and stat.st_ino (inode number) values we can determine if the output.txt is the same file on disk as what stdout is currently pointing to.

Make sense? I could send you a code example if you’d like.

Reply
Sean Graham says:

July 15, 2004 at 7:58 pm

corrections:

1) “Using the fstat(2) function we can OBTAIN stat structures…”

2) You aren’t comparing the st_dev and st_ino to each other, you ar comparing the st_dev values in the two structures as well as the st_ino values.. :)

Reply
Scot Hacker says:

July 15, 2004 at 11:34 pm

OK, in English? :)

Here’s what I think I understand from that, correct me if this is wrong: grep isn’t aware of the file its outputting to, because it only knows that its sending to stdout. But it *is* aware of the same file as one of the items it’s scanning (i.e. it knows much more about its input than its output, as far as the fileystem is concerned). And in grep 2.5, this awareness must have been expanded to check whether any of the files in the input array were changing underneath it (i.e whether the inode set was changing as grep is running).

Is that about right? Makes sense when you think about it that way. OK, unqueue the Twilight Zone…

Reply
Sean M. Graham says:

July 16, 2004 at 7:35 am

I think you are on the right track. What is happening is that stdout is treated just as any other file. So when you redirect stdout on the command line, you are really just telling the OS to hook up stdout to that file as opposed to the console.

Since it is treated just as another file, you can use the fstat() function to get status informaton for each of the files. By looking at the inode number for stdout and comparing it to the inode of file we are currently grepping, we can determine if they are the same file.

Make more sense?

Reply
Scot Hacker says:

July 16, 2004 at 12:27 pm

Ah… light starts to go on. So my thinking was that grep can’t know what happens to data after it’s passed to stdout — only the shell has that info, and whether it’s process with further utilities, redirected to a file etc is unknown to grep. But if I understand correctly, you’re saying that stdout actually yields a file descriptor, even though no file is implicitly created just by passing off to stdout. Is that right? And that that file descriptor is passed back up the chain to grep?

I don’t know, this is tough to visualize. I’m so used to thinking of stdout as being the endpoint of a utility’s awareness, and it being the shell, not the utility that takes things from there.

Reply
steve jenson says:

July 17, 2004 at 6:44 pm

Here’s my take:

stdout has an FD of 1. Ouput redirection attaches the file to that FD (unless you specify another FD). grep does know that it’s writing to FD 1 and sees that some file has FD 1 and so doesn’t touch that file.

I believe this is correct and I double checked the bash manpage and it seems to back me up: (grep for Redirecting Output if you’re using 10.3)

Reply
Scot Hacker says:

July 18, 2004 at 7:17 am

Steve, thanks – odd that Panther’s man page would back this up though, since Panther ships with grep 2.4, where I first encountered the problem.

Reply
steve jenson says:

July 18, 2004 at 3:51 pm

Well, the shell only does the FD manipulation, grep still has to do the circular dependency checking to not shoot itself in the foot. I guess grep 2.5 was the first to do so.

Reply
Sean M. Graham says:

July 19, 2004 at 6:24 am

stdout has an FD of 1. Ouput redirection attaches the file to that FD (unless you specify another FD). grep does know that it’s writing to FD 1 and sees that some file has FD 1 and so doesn’t touch that file.

Actually, this probably isn’t quite right. When grep uses fopen() to open the file, since it is opening the file a second time (the first was when the file was connected to stdout/fd 1) it will probably return a different file descriptor. There is no guarantee that multiple calls to fopen on the same file will return the same file descriptor.

Reply

9 Replies to “grep puzzler”

Leave a Reply Cancel reply