This is the inaugural post in a series that we’ll be calling Ruby TMTOWTDI (There’s more than one way to do it). Usually the TMTOWTDI acronym is used as a disparaging term towards flexible languages, which offer a myriad ways of solving most given problems. This series, however, aims to use this property of Ruby as an educational tool.
The format is as follows: we’ll present a problem, then everyone will try and write the most idiomatic, concise code that solves the problem in a readable fashion. This is not meant to be a golfing competition, nor an optimization one. However, we will benchmark the solutions for a rough idea on how different styles affect performance. For every installment we’ll have a performance winner for fastest time and a ’style’ winner for the most elegant code.
This gives us at CitrusByte a better understanding of each other’s coding styles, and stimulates discussions about little tricks we may not be aware of. Best of all, it’s fun as heck to be a part of.
You can get involved too! Leave a comment with your solution if you think you can best the ones listed here. We also encourage you to propose a simple problem (less than 20 lines of code to solve) for the next instalment in the series.
Since this is the first time we’re doing it, we’ve picked a very simple problem to solve as a quick warm-up.
Problem: given a string ’str’ of the format
"a10 b20 c25 d40"
convert it to a hash
{'a' => 10, 'b' => 20, 'c' => 25, 'd' => 40 }
Note that the values of this hash are integers, not strings. If the string is
"a10 a20"
the output should be
{'a' => 20}
.
Leaving some spoiler space in case you want to try it on your own. Below are selected solutions from our team.
Spoiler space
Solutions
Michel
Hash[*str.scan(/(.)(\d{2})/).map { |k, v| [k, v.to_i] }.flatten]
Time taken: 9.99s
Tim
Hash[*(str.scan(/(\w)(\d+)/).map {|k, v| [k, v.to_i] }.flatten)]
Time taken: 9.40s
Tim’s comments: the splat operator has low precedence so the outermost set of
parantheses are redundant, but I felt it’s better to have them there to make it
very obvious what’s going on.
Nicolas (winner: speed award)
str.scan(/\s*(\D+)(\d+)/).inject({}) {|h,(k,v)| h[k] = v.to_i; h }
Time taken: 7.19s
Nicolas’s comments: I like inject/reduce/fold, whatever you call it. It’s
concise and usually a perfect fit for constructing enumerable data structures.
Martin (winner: style award)
str.split.inject({}) {|h,i| h[i[/^\D+/]] = i[/\d+$/].to_i; h }
Time taken: 8.75s
Ari
hash = str.split.inject({}) do |set, h|
set.merge( {h[0].chr => h[1..-1].to_i} )
end
Time taken: 23.02s
Ari’s comments: I knew everyone else would be strutting their one-liners, so I went for as much clarity as possible. I feel this code not only solves the problem, but also makes it very apparent what the desired solution is.
Benchmark code
require 'benchmark'
ITERATION = 400
def measure(&block)
ITERATION.times { block.call }
end
str = ''
5000.times { str << ('a'..'z').to_a[rand(26)] + (rand(90)+10).to_s + ' ' }
br = Benchmark.bmbm do |b|
b.report("Running #{ITERATION} times on string of length #{str.length}") do
measure {
#### Code to measure goes here
}
end
end
Conclusion
Lots of interesting styles here. We have two ways of creating hashes: with the Hash[ [key =>|, value]* method and with inject().
We’ve also seen various ways of breaking up a string: with scan(), with split(), and with regex arguments to a string’s [] method. While we cannot come to definite conclusions about performance (since the code that is being compared doesn’t differ just in one area), we can see a couple things going on:
- running Hash[] on an array as opposed to creating it directly via inject hurt Michel and Tim’s performances compared to Martin/Nicolas
- merge() seems to be significantly more costly than straight addition/replacement of hash keys. If you’re ever injecting into a hash it may be wiser (if your problem allows) to follow the
{ hash[key] = value; hash }format instead of merging.
We’ve picked Martin as the style winner for the elegance of his code which fit of the string parsing inside the inject block as opposed to the scan-based solutions which did some of it outside in the scan. The h[i[/^\D+/]] = i[/\d+$/].to_i portion was the clearest signal of intent for the contents of the hash.
Meanwhile, Nicolas is the performance winner by a mile, coming in more than 20% faster than the nearest competitor.
Congratulations to Martin (style) and Nicolas (speed) for being the winners of the inaugural CitrusByte Ruby TMTOWTDI. Stay tuned for the next episode!
EDIT
Additional comments by Nicolas:
Anyway, after thinking about it, even if I like inject a lot, there’s one big gain for the Hash[*...] as Michel posted. It
transmits intent better, in a way.
You look at that and the entire code sequence is a Hash construction, plain and simple, even if you ignore the subexpression in the middle, you know the outcome. And if you look one character into the brackets, then you know it’s constructing a hash from some transformation of an existing enumerable. Bah, just something that responds to #to_a.
Instead, using inject, the code flows (at least in my head) nicer as you don’t have to parse subexpressions, but it’s not immediately obvious what’s it about.
Hm, actually, depends on what intent is better to show. I mean, the objective is something like “to transform a string in a certain format into a more traversable structure” or “to create a hash out of a string in a certain format?”
That (not so) subtle difference might make me decide about which way is better, but then the description of the problem is not descriptive enough. Good, that means that the problem lies in the exercise’s designer. I can sleep well :P
(Probably there’s a third approach that to a degree involves a recursive lambda and someone flying over a desktop yelling “stand back! I know hash construction”, but my ninjitsu is not good enough for that yet)

{ 22 comments… read them below or add one }
The fastest approach I’ve found so far:
hash = Hash.new; str.scan(/(\w)(\d\d)/).each { |k, v| hash[k] = v.to_i }
This is what I’d do, takes less than 0.1 seconds using your benchmark code:
hash = {}; str.split.each {|str| hash.merge!({str[0, 1] => str[1, 2].to_i})}
Here’s a first attempt at a one-liner without looking at other answers:
Hash[*str.split(/\s/).map { |token| token.match(/(\D+)(\d+)/) }.inject([]) { |arr, match| arr << match[1] << match[2].to_i }]
Also, if you didn’t need the values to actually be integers, Michel’s solution could be condensed into:
Hash[*str.scan(/(\D+)(\d+)/).flatten]
Duh, the above example doesn’t account for the spaces, this one does:
Hash[*str.scan(/(\D+)(\d+)\s?/).flatten]
Actually seeing all the solutions, my regexp is horrible :P I should’ve used \w instead of \D. And besides that’s even faster. The benchmark in my computer for my original solution reports 7.08s, and for the same approach (str.scan, then inject) but using Tim’s regular expression times at 6.45s
Not that I care that much for speed over readability, but in this case the regular expression is actually clearer, and even faster to parse! :)
I can’t wait for episode II ^_^
Oh, and Michel’s fastest solution (first comment) times at 4.98s in my computer, so it’s still the fastest solution around.
Nicolas, are you just trying to show that your computer is faster than mine? =P
My new method:
hash = Hash.new; str.split.each {|str| hash[str[0, 1]] = str[1, 2].to_i}
Calling Hash.new directly didn’t really change anything, but replacing #merge! with a direct replace/add made a difference. However, the script went from 0.0018 (real) to 0.0016 (real), so not a big deal.
I think the main difference between my solution and all the others is the use of a regular expression. In this case, we know that the first character will always be the key, and the next two characters will always be the value. Because the pattern will never change, isn’t running the expression a bit too costly?
Michael and Luke, thanks for your thoughts!
Yeah the point of requiring integers was to make it a slightly more involving problem. Requiring integers would entail more variations of answers since you’d need a way of iterating for the to_i.
I generally tend to favor regexes (despite performance hits over straight-up string indexing) because most of the time they communicate intent more.
Another possibility would be something like this:
str.split.inject({}) {|h, s| h[s.slice!(0).chr] = s.to_i; h }
In 1.9 it would look a bit nicer:
str.split.inject({}) {|h, s| h[s.slice!(0)] = s.to_i; h }
(The latter is untested)
Here another possibility:
Hash[*str.split.inject([]) {|a, s| a << s.slice!(0).chr << s.to_i }]
Grr. Stupid me. slice can return strings for real slices. So that one would work too:
Hash[*str.split.inject([]) {|a, s| a << s.slice!(0, 1) << s.to_i }]
Here’s mine, I tested it on two separate machines and it’s about 1 second faster than Nicolas’s method.
Ugly one-liner form:
require ‘enumerator’; h = Hash.new; str.unpack(’aa2x’ * (str.length / 4)).eachslice(2) { |(k,v)| h[k] = v.toi }; h
Slightly better multi-line form:
require ‘enumerator’ h = Hash.new str.unpack(’aa2x’ * (str.length / 4)).eachslice(2) { |(k,v)| h[k] = v.toi } h
Here’s the averages I got from the first machine I tested it on:
Martin: 19.71 Nicolas: 14.09 Andrew: 13.26
And the second machine:
Martin: 7.62 Nicolas: 5.60 Andrew: 4.17
It’s probably cheating to require a separate module from stdlib, but you didn’t specify whether I could or not in the requirements :-) Personally, I had never used the Enumerable#each_slice method before.
A great post, I love these as its always a good way to pick up tips.
This is especially interesting looking into the dark art of creating hashes using blocks.
I suppose these things are system dependeent. I tried the supplied methods and all were clocking in around the 25-30 seconds mark (Yes, I have a slow machine).
Applying a little KISS to the idea I came up with the very simple:
output = Hash.new
inter = Array.new
inter = str.split
inter.each do |s|
output[s[0,1]] = s[1,3].to_i
end
Which clocked in around 12 secs on my machine. It seems that forcing the types might help the compiler instead of making it choose.
Interesting test though.
Here’s one more:
s.split.sort.inject({}) { |h, a| h.merge(a[/^[a-z]/] => a[/[0-9]+$/].to_i) }
I could be wrong, but I don’t think “a20″.toi results in 20. I believe, if the number is first, doing a toi on a string will return the number. But when the letter is first, that’s not the case, which means that there still needs to be some sort of parsing, rather than a v.to_i implementation, right?
Again, maybe I’m wrong about that.
Whoops, I hadn’t read all of the comments yet when I submitted my post. I thought Nicolás said that Martin’s was 4.98 seconds.
I went ahead tested some of the faster implementations to see who came out winner on my two systems.
The first system is a AMD Athlon XP 2400+ with 2.5GB of RAM running “Linux 2.6.24-17-generic i686″ (Ubuntu 8.04)
Andrew: 13.26
Armin #1: 11.45
Armin #2: 15.56
John: 8.81
Luke: 30.18
Nicolás: 14.09
Martin: 19.71
Michel: 10.33
Ryan: 52.45
The second system is an Intel Xeon 5130 @ 2.00GHz with 2GB of RAM running “FreeBSD 7.0-RELEASE amd64″
Andrew: 4.17
Armin #1: 4.98
Armin #2: 5.43
John: 3.86
Luke: 9.88
Nicolás: 5.60
Martin: 7.62
Michel: 4.01
Ryan: 21.33
For some reason, Armin’s first implementation is faster than mine on 32-bit Linux, while mine is faster than his on 64-bit FreeBSD.
John’s implementation still mops the floor with the rest of us.
My (without looking) solution was pretty similar to Nick’s although I didn’t assume the key would always be a single character.
str.split.inject({}) {|hash, item| item =~ /(\w+?)(\d+)/; hash[$1] = $2.to_i; hash}
Something strange I noticed was the parentheses around two of the block parameters in Nick’s solution. I removed them and to my surprise it had a drastic impact on performance.
With parentheses: 6.66 Without parentheses: 8.45
I got similar result with JRuby. Can anyone explain what the presence of parentheses in the argument list even does and how that might be impacting performance?
Great post! I look forward to more of these.
@David: The parentheses are there to unpack the second item into (k, v) instead of having to split it inside, I didn’t know it would give an improvement on performance.
@Andrew: Michel’s “quickest version” (the one posted in the first comment) run in 4.98 seconds in my machine, one of the new penryn MacBook Pros (2.4GHz Core 2 Duo, 2GB @667 stock)
h = Hash.new;str.each(’ ‘) { |s| h[s[0,1]] = s[1,3].to_i }
just wondering what about this?