C# vs. C#
In
The problem takes an archive of newsgroup articles and creates one file containing a list of all unique words with their occurrence count sorted by word and another sorted by occurrence.
Here is Mogens’ C# solution:
class Program { static void Main() { const string dir = @"c:\temp\20_newsgroups"; Stopwatch stopwatch = Stopwatch.StartNew(); var regex = new Regex(@"\w+", RegexOptions.Compiled); var list = (from filename in Directory.GetFiles(dir, "*.*", SearchOption.AllDirectories) from match in regex.Matches(File.ReadAllText(filename).ToLower()).Cast<Match>() let word = match.Value group word by word into aggregate select new { Word = aggregate.Key, Count = aggregate.Count(), Text = string.Format("{0}\t{1}", aggregate.Key, aggregate.Count()) }) .ToList(); File.WriteAllLines(@"words-by-count.txt", list.OrderBy(c => c.Count).Select(c => c.Text).ToArray()); File.WriteAllLines(@"words-by-word.txt", list.OrderBy(c => c.Word).Select(c => c.Text).ToArray()); Console.WriteLine("Elapsed: {0:0.0} seconds", stopwatch.Elapsed.TotalSeconds); } }
While my lack of familiarity with the languages used in the other examples made it a little more difficult to appreciate their strengths, I felt the C# example provided by Mogens was fairly concise and intuitive by comparison. Nevertheless, I couldn’t help wondering if I might be able to improve it in some way, so I set out to see what I could come up with.
Here are the results:
static void Solution2() { var regex = new Regex(@"\W+", RegexOptions.Compiled); var d = new Dictionary<string, int>(); Directory.GetFiles(dir, "*.*", SearchOption.AllDirectories) .ForEach(file => regex.Split(File.ReadAllText(file).ToLower()) .ForEach(s => d[s] = 1 + (d.ContainsKey(s) ? d[s] : 0))); File.WriteAllLines(@"words-by-count2.txt", d.OrderBy(p => p.Value).Select(p => string.Format("{0}\t{1}", p.Key, p.Value))); File.WriteAllLines(@"words-by-word2.txt", d.OrderBy(p => p.Key).Select(p => string.Format("{0}\t{1}", p.Key, p.Value))); }
The primary differences in this example are the use of a Dictionary to accumulate the frequency count rather than grouping, and the use of the Regex.Split rather than Regex.Match to avoid the need of casting the resulting collection. Based on my measurements, this approach is approximately 36% faster on average than the first solution and is a bit more concise.
Overall, I don’t think this example has a varied enough problem domain to really compare the strengths of different languages as some have done, but I found it a fun exercise to see how I might improve the original C# version nonetheless.