Archive for the ‘text mining’ Category

Call for Papers: PLoS Text Mining Collection

Posted 21 May 2012 — by caseybergman
Category text mining

[For a some background on this initiative, please see this Blog post.  -CMB]

The Public Library of Science (PLoS) seeks submissions in the broad field of text-mining research for a collection to be launched across all of its journals in 2013. All submissions submitted before October 30th, 2012 will be considered for the launch of the collection. Please read the following post for further information on how to submit your article.

The scientific literature is exponentially increasing in size, with thousands of new papers published every day. Few researchers are able to keep track of all new publications, even in their own field, reducing the quality of scholarship and leading to undesirable outcomes like redundant publication. While social media and expert recommendation systems provide partial solutions to the problem of keeping up with the literature, systematically identifying relevant articles and extracting key information from them can only come through automated text-mining technologies.

Research in text mining has made incredible advances over the last decade, driven through community challenges and increasingly sophisticated computational technologies. However, the promise of text mining to accelerate and enhance research largely has not yet been fulfilled, primarily since the vast majority of the published scientific literature is not published under an Open Access model. As Open Access publishing yields an ever-growing archive of unrestricted full-text articles, text mining will play an increasingly important role in drilling down to essential research and data in scientific literature in the 21st century scholarly landscape.

As part of its commitment to realizing the maximal utility of Open Access literature, PLoS is launching a collection of articles dedicated to highlighting the importance of research in the area of text mining. The launch of this Text Mining Collection complements related PLoS Collections on Open Access and Altmetrics (forthcoming), as well as the recent release of the PLoS Application Programming Interface, which provides an open API to PLoS journal content.

As part of this Text Mining Collection, we are making a call for high quality submissions that advance the field of text-mining research, including:

  • New methods for the retrieval or extraction of published scientific facts
  • Large-scale analysis of data extracted from the scientific literature
  • New interfaces for accessing the scientific literature
  • Semantic enrichment of scientific articles
  • Linking the literature to scientific databases
  • Application of text mining to database curation
  • Approaches for integrating text mining into workflows
  • Resources (ontologies, corpora) to improve text mining research

Please note that all submissions submitted before October 30th, 2012 will be considered for the launch of the collection (expected early 2013); submissions after this date will still be considered for the collection, but may not appear in the collection at launch.

Submission Guidelines
If you wish to submit your research to the PLoS Text Mining Collection, please consider the following when preparing your manuscript:

All articles must adhere to the submission guidelines of the PLoS journal to which you submit.
Standard PLoS policies and relevant publication fees apply to all submissions.
Submission to any PLoS journal as part of the Text Mining Collection does not guarantee publication.

When you are ready to submit your manuscript to the collection, please log in to the relevant PLoS manuscript submission system and mention the Collection’s name in your cover letter. This will ensure that the staff is aware of your submission to the Collection. The submission systems can be found on the individual journal websites.

Please contact Samuel Moore ( if you would like further information about how to submit your research to the PLoS Text Mining Collection.

Casey Bergman (University of Manchester)
Lawrence Hunter (University of Colorado-Denver)
Andrey Rzhetsky (University of Chicago)

Cross posted at the PLoS Blog

Speeding-Up WordNet::Similarity

Posted 10 Jun 2010 — by caseybergman
Category linux, text mining

One nifty Perl module is the WordNet::Similarity module, which provides a number of similarity measures between terms found in WordNet. WordNet::Similarity can be either used from the shell via the command, or alternatively, it can be run as a server by starting The latter has the advantage that WordNet will not be loaded into memory each time a measurement is taken, which speeds up queries drastically.

Does it really?

Unfortunately, the current implementation of the server allows us only to make one query per opened TCP connection, before the socket is closed again. I also experienced an unexplained grace time that passes before a server process actually finishes, which becomes a significant bottleneck when performing a lot of queries.

I am providing here a tiny patch for the file that abandons said limitations. With the patch applied, multiple queries can be made per TCP connection and there is no delay between them. The small changes that I have made speed-up the querying process by an unbelievable factor 10. Below you can find the little Ruby script that I used to measure the time needed for the original version of the server (running on port 30000) and the new version of the server (running on port 31134).


require 'socket'

left_words = [ 'dinosaur', 'elephant', 'horse', 'zebra', 'lion',
  'tiger', 'dog', 'cat', 'mouse' ]
right_words = [ 'lemur', 'salamander', 'gecko', 'chameleon',
  'lizard', 'iguana' ]

# Original implementation
puts 'Original implementation'
puts '-----------------------'
original_start =
left_words.each { |left_word|
  right_words.each { |right_word|
    socket ='localhost', 30000)
    socket.puts("r #{left_word}#n#1 #{right_word}#n#1 lesk\r\n\r\n")
    response = socket.gets.chomp

    redo if response == 'busy'

    measure = response.split.last
    puts "#{left_word} compared to #{right_word}: #{measure}"
original_stop =

# New implementation
puts ''
puts 'New implementation'
puts '------------------'
new_start =
socket ='localhost', 31134)
left_words.each { |left_word|
  right_words.each { |right_word|
    socket.puts("r #{left_word}#n#1 #{right_word}#n#1 lesk\r\n\r\n")
    response = socket.gets.chomp

    measure = response.split.last
    puts "#{left_word} compared to #{right_word}: #{measure}"
# Let the server close the socket,
# or otherwise the child process may loop forever.
# socket.close
new_stop =

puts ''
puts 'Time required'
puts '-------------'
puts 'Original implementation: ' <<
  (original_stop.to_i - original_start.to_i).to_s <<
puts 'New implementation: ' <<
  (new_stop.to_i - new_start.to_i).to_s <<

The new implementation has an additional command called ‘e‘ that can be used to close the socket to the server. The actual patch of can be found here: similarity_server.patch

The “real” impact factor of Nucleic Acids Research is ~5.6

Posted 27 Apr 2010 — by maxhaussler
Category scientometrics, text mining

Nucleic Acids Research is a respected journal, publishing articles about e.g. restriction enzymes, and DNA analysis. Twice a year they have a “special issue” with updates on databases and bioinformatics tools on the internet. These short “method papers” usually just resume what has been added to the database/tool and rarely report research results directly. But they do attract a lot of citations: people who are using a certain website or software tool are expected to cite the corresponding method article in Nucleic Acids Research.

Is this practice increasing the impact factor of this journal? Certainly, but how much? It turned out that the answer to this question takes only 45 minutes of web searching and a one-line program (or an excel formula).

The impact factor 2008 is calculated based on articles in 2006 and 2007. So I’ve downloaded the citation data from and by dividing number of citations (14957) by number of articles (2260), arrived at an impact factor of 6.61 (the official impact factor is 6.87, this is probably because my list copied from Scopus includes some articles which are considered “not-citable” by ISI Thomson or because scopus has less data than Thomson). The 444 articles in special issues attracted 4750 citations. If NAR did not have special issues, its impact factor would therefore be around 5.6 (a bit higher, perhaps 5.8, due to the non-citable issue).

The data from scopus therefore gives everyone the possibility to quantify how much methods, reviews or original research determines the impact factor of a journal.

How to redo this: Go to, search for “srctitle(nucleic acids research)”, select “Nucleic Acids Research” and Year:2006, click limit, click “Citation tracker” and export as a textfile. Repeated the same thing for 2007. Convert the text files with excel to tab-separated files and run the following one-liner on them:

cat 2006.txt 2007.txt | grep -v rratum | gawk 'BEGIN {FS="\t";}
/^200[0-9]/ {articles+=1; cites+=$11; if ($7=="") { webArticles+=1; webCites+=$11;}}
END {print "articles:"articles; print "citations:"cites; print "impact factor:" (cites/articles);
print "web/db articles:"webArticles; print "web/db article citations "webCites;
print "impact factor without web/db issue:" ((cites - webCites)/(articles- webArticles));}'

Note: a previous version of this post estimated the impact factor to be 4.5. This was wrong, as I forgot to remove the number of articles in the special issues from the total number of articles. I am very sorry for this bug.

The namazu fulltext indexer

Posted 10 Nov 2009 — by maxhaussler
Category text mining

There is certainly Lucene and the Mysql fulltext indexing facilities, but when you just want to have a simple and quick fulltext indexer on the command line for a bunch of files and a cgi script to search a bunch of webpages or html files, Namazu is a good choice: It is dead-simple to install, comes with a good tutorial and a very short documentation. Just what I expected. My congratulations to the Japanese programmers for such a readable English documentation. You don’t need a book as with Lucene.

Installation: I untarred the file. As I am not root on our cluster, just a regular user, I had to create ~/local, cd into namazu/File-MMagic, run

perl Makefile.PL LIB=$HOME/local/lib INSTALLMAN3DIR=$HOME/local/man
make install

Then went back into the namazu directory and run

./configure --prefix=$HOME/local --with-pmdir=$HOME/local/lib
make install

Then created the index of my files (took a couple of hours):

~/local/bin/mknmz -O scratch/namazu-index/pmc/ text2genome/usr/fulltext/pmc/articles/

In my case (the fulltext of Pubmed Central, 150k documents), this also showed the number keywords which was surprisingly high, eight million:

Date:                Mon Nov  9 23:37:20 2009
Added Documents: 56,868
Size (bytes): 1,291,360,854
Total Documents: 56,868
Added Keywords: 7,958,364
Total Keywords: 7,958,364
Time (sec): 24,515
File/Sec: 2.32
System: linux
Perl: 5.008008
Namazu: 2.0.20

I created ~/.numazurc and added one single line

Index ~/scratch/namazu-index/pmc/

Now searching is as simple as

namazu --all RT-PCR
namazu --count RT-PCR

and will take merely some milliseconds instead of hours…