WideFinder II
My entrant to WideFinder II has performed a lot better than expected.
WideFinder Results
Ideally you would not have to write any customised code, so you would end up with the following, config file and groovy report.
Config File
Groovy Report
But for performance reasons, it uses an optimised (non-regex) line parser.
Customised Config
Java Report
With this combination, it ran the 42GB dataset in 13 minutes 26 seconds. Which I am happy enough with the leave it at that. I think I could get it below 10 minutes, because currently the 32 threads seem to be simultaneously either doing IO or processing a chunk of the file.
I don't expect Kolja to ever beat a custom designed low level approach, since Kolja does a lot of extra work, because its a generalised approach.
However, the key advantage of this approach is that once you have written the config file you can do any of the following
- View the file interactively in a much easier format
- Tail the file
- Run your own specific report
- Run an existing report e.g. the frequency report on the url field.
- Run a report with a multithreaded version or across machines via gridgain.