Short preface: at a job interview, Zach Cox was told to aggregate words and word counts from a bunch of files into two files, sorted alphabetically and by word count respectively, which he did in Ruby and Scala. This led Lau Bjørn Jensen to do the same thing in Clojure, which apparantly sparked other people to do it in Java, Python etc.
Inspired by the afore mentioned problem, and an extended train ride home (thank you, Danish National Railways!!), I decided to see what a C# (v. 3) version could look like:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 |
namespace NewsReader { using System; using System.IO; using System.Linq; using System.Text.RegularExpressions; using System.Diagnostics; class Program { static void Main() { const string dir = @"c:\temp\20_newsgroups"; var stopwatch = Stopwatch.StartNew(); var regex = new Regex(@"\w+", RegexOptions.Compiled); var list = (from filename in Directory.GetFiles(dir, "*.*", SearchOption.AllDirectories) from match in regex.Matches(File.ReadAllText(filename).ToLower()).Cast<Match>() let word = match.Value group word by word into aggregate select new { Word = aggregate.Key, Count = aggregate.Count() , Text = string.Format("{0}\t{1}", aggregate.Key, aggregate.Count()) }) .ToList(); File.WriteAllLines(@"words-by-count.txt", list.OrderBy(c => c.Count).Select(c => c.Text).ToArray()); File.WriteAllLines(@"words-by-word.txt", list.OrderBy(c => c.Word).Select(c => c.Text).ToArray()); Console.WriteLine("Elapsed: {0:0.0} seconds", stopwatch.Elapsed.TotalSeconds); } } } |
Weighing in at 36 lines and executing in 10.2 seconds (on my Intel Core 2 laptop with 4 GB RAM), I think this is a pretty clear and performant alternative to the other languages mentioned.
Smukt!
The linq code looks very interesting to a non Microsofter…