File Searching Speed



Torana over Shantipur, next to Svayambhu stupa, Kathmandu

Noticing that some of my colleagues get results for their GREP searches almost instantly, I began to wonder why my searches of all my Sanskrit etexts took close to 2 minutes. They use grep from the command line (Terminal on OSX), and attributed the speed to that. Not wanting to drop my one-stop application BBEdit, where I can edit the files returned when I search them right then and there, I decided to run a test with a colleague who had a similar machine and etext collection. The same search on his was finished in just 20 seconds, while mine took almost 2 minutes. He attributed it to having converted most of his files to have Unix line breaks. Not having luck with batch converter applications, I realized I would have to go through every folder individually and batch convert on a smaller scale. This allowed me the chance to see what was there and clean out non-text files. I moved several hundred megabytes of web-archives and PDFs out of the etext collection and reduced my number of files and size of the collection by about 25%. In the process I converted everything to have the ending .txt, whereas before there was a plethora of endingless files and files with many different types of extensions. I haven’t gotten to converting the line breaks yet, but now I can search all of my Sanskrit etexts in around 20 seconds, and have them ready for editing instantly in the results window of BBEdit. This is a huge improvement, because now I can search more freely, whereas before I often limited the searches to specific folders to keep the speed within reason.

UPDATE 2021: processors and disks have gotten faster, and so has grep. I now prefer “ripgrep” to search my whole archive of etexts from the command line in less than 1 second.