File Searching Speed

Torana over entrance to Śāntipura, Svayambhunath, Kathmandu

Noticing that some of my colleagues get results for their GREP searches almost instantly, I began to wonder why my searches of all my Sanskrit etexts took close to 2 minutes.  They use grep from the command line (Terminal on OSX), and attributed the speed to that.  Not wanting to drop my one-stop application BBEdit, where I can edit the files returned when I search them right then and there, I decided to run a test with a colleague who had a similar machine and etext collection.  The same search on his was finished in just 20 seconds, while mine took almost 2 minutes.  He attributed it to having converted most of his files to have Unix line breaks.  Not having luck with batch converter applications, I realized I would have to go through every folder individually and batch convert on a smaller scale.  This allowed me the chance to see what was there and clean out non-text files.  I moved several hundred megabytes of web-archives and PDFs out of the etext collection and reduced my number of files and size of the collection by about 25%.  In the process I converted everything to have the ending .txt, whereas before there was a plethora of endingless files and files with many different types of extensions.  I haven’t gotten to converting the line breaks yet, but now I can search all of my Sanskrit etexts in around 20 seconds, and have them ready for editing instantly in the results window of BBEdit.  This is a huge improvement, because now I can search more freely, whereas before I often limited the searches to specific folders to keep the speed within reason.

Posted in GREP

4 Responses to File Searching Speed

  1. Wavatar Haru says:

    Glad to hear that you have been able to speed searching up to such a significant extent; that will indeed make things a lot easier. No doubt such a speed is quite sufficient. It is still possible (though this is perhaps of at most academic interest, since obviously for you BBEdit has several advantages) that using the terminal might enable another jump. I don’t know if you did any comparative tests. On my Linux machine, a command-line grep seems to rarely take more than about 1 and a quarter seconds. (Perhaps I am searching a somewhat smaller e-text collection though.)

  2. Wavatar Michael says:

    Thanks for the comment, Haru. I have now converted all of the linebreaks to Unix, and it is now even faster. In the process I found many files whose encoding was garbled and had to be changed, so I also made about a hundred more files searchable. Now the command line GREP returns results in a few seconds, and BBEdit around 15 seconds. There are still a large number of duplicates and near duplicates that will one day have to be fixed by hand. I have tried duplicate detecting programs, but many of these files differ only in that someone went through and converted / to |. I don’t know why they kept both versions, but the automated duplicate finders have a hard time finding things like that. Some files also have an extra blank line in between every line of text, the apparent result of a batch attempt to change all the line break types. These also are not easily detectable.

  3. Wavatar Somdev says:

    There appears to be a genuine (new?) speed problem with grep in the OS X UTF-8 locale, see:

    http://tdas.wordpress.com/2008/02/03/speed-up-grep/

    Changing the locale to C (export LC_ALL=C) did make a big difference on my machine. Terminal grepping (especially fgrep) will evidently always be faster for huge searches. Since I gave up on TeXShop and now use BBedit with Skim, the terminal is always open for the log files anyway, so it is convenient too. But would not want to give up on the BBEdit search result browser, it is just incredibly user friendly.
    Saturday, February 27, 2010 – 01:43 AM
    Somdev
    PS: pcregrep appears unaffected by the issue I mentioned above, and since it is perl it is actually more powerful than ordinary gnu grep. Also if you change your locale you will have issues with folder names with utf-8 characters. You can install pcregrep quite easily, there is a ready made installer here (easy):

    http://www.rudix.org/packages-opq.html

    Or get the source from:

    http://www.pcre.org/ and then compile it yourself (tedious).

    Another good and very fast search browser is the free Xcode app. Just drag the whole folder of your etexts on to it and and use the find command, it provides a browser just like BBedit’s but it seems faster to me.

  4. Do write more about the technical aspects of working with Sanskrit etexts. I myself think of writing an article how to make a devanagari book in InDesign, but I guess it’s of no value for those, who do not understand and of little value to those who already do it.

Leave a Reply

Your email address will not be published.

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>