Speeding-Up WordNet::Similarity

No Comments
Posted 10 Jun 2010 in linux, text mining

One nifty Perl module is the WordNet::Similarity module, which provides a number of similarity measures between terms found in WordNet. WordNet::Similarity can be either used from the shell via the command similarity.pl, or alternatively, it can be run as a server by starting similarity_server.pl. The latter has the advantage that WordNet will not be loaded into memory each time a measurement is taken, which speeds up queries drastically.

Does it really?

Unfortunately, the current implementation of the server allows us only to make one query per opened TCP connection, before the socket is closed again. I also experienced an unexplained grace time that passes before a server process actually finishes, which becomes a significant bottleneck when performing a lot of queries.

I am providing here a tiny patch for the similarity_server.pl file that abandons said limitations. With the patch applied, multiple queries can be made per TCP connection and there is no delay between them. The small changes that I have made speed-up the querying process by an unbelievable factor 10. Below you can find the little Ruby script that I used to measure the time needed for the original version of the server (running on port 30000) and the new version of the server (running on port 31134).

#!/usr/bin/ruby

require 'socket'

left_words = [ 'dinosaur', 'elephant', 'horse', 'zebra', 'lion',
  'tiger', 'dog', 'cat', 'mouse' ]
right_words = [ 'lemur', 'salamander', 'gecko', 'chameleon',
  'lizard', 'iguana' ]

# Original implementation
puts 'Original implementation'
puts '-----------------------'
original_start = Time.new
left_words.each { |left_word|
  right_words.each { |right_word|
    socket = TCPsocket.open('localhost', 30000)
    socket.puts("r #{left_word}#n#1 #{right_word}#n#1 lesk\r\n\r\n")
    response = socket.gets.chomp
    socket.close

    redo if response == 'busy'

    measure = response.split.last
    puts "#{left_word} compared to #{right_word}: #{measure}"
  }
}
original_stop = Time.new

# New implementation
puts ''
puts 'New implementation'
puts '------------------'
new_start = Time.new
socket = TCPsocket.open('localhost', 31134)
left_words.each { |left_word|
  right_words.each { |right_word|
    socket.puts("r #{left_word}#n#1 #{right_word}#n#1 lesk\r\n\r\n")
    response = socket.gets.chomp

    measure = response.split.last
    puts "#{left_word} compared to #{right_word}: #{measure}"
  }
}
socket.puts("e\r\n\r\n")
# Let the server close the socket,
# or otherwise the child process may loop forever.
# socket.close
new_stop = Time.new

puts ''
puts 'Time required'
puts '-------------'
puts 'Original implementation: ' <<
  (original_stop.to_i - original_start.to_i).to_s <<
  's'
puts 'New implementation: ' <<
  (new_stop.to_i - new_start.to_i).to_s <<
  's'

The new implementation has an additional command called ‘e‘ that can be used to close the socket to the server. The actual patch of similarity_server.pl can be found here: similarity_server.patch


Add Your Comment