Here is an interview with Tal Rotbart, founder and CTO of SpringSense. The Springsense API has been covered by us before here. The API allows developers to recognize any nouns in a body of text and can be used for returning meaning of text, submitting sentences for meaning detection, and submitting text for disambiguation.
PW- Describe the SpringSense Meaning Recognition API? How is it used by potential customers and developers
Tal- The easiest way to think of the Meaning Recognition API, is “text in, disambiguated text out”.
A classic example we use would be where the text “cat vet” would be disambiguated as “cat.n.01 veterinarian.n.01” while the text “military vet” would be recognized as “military.n.01 veteran.n.02”. The noun identifiers are based on the Wordnet 3.1 ontology.
This is a simplification, as the output also includes other probable disambiguations (along with their probabilities) and the Wordnet definitions (glosses) for the nouns. Additionally, all words in the text are tagged with their part-of-speech identifier, while only nouns, proper nouns and noun-like verbs are disambiguated.
The most common use of the API to this date has been as part of enterprise search solutions. The API was used to annotate indexed documents during ingestion, and for augmenting the queries on retrieval.
This allowed our customers to boost retrieval relevance for search applications without requiring application specific tuning. This is especially powerful for organizations with documents that span across multiple domains. One of our most avid customers is a major Australian university and our API improved their online course search significantly. This in turn increased their conversion rates and improved their bottom line.
PW- You claim to be the world’s most accurate Word Sense Disambiguation API in a recent release. What are some of the ways you finetune your NLP libraries.
Tal- Just to clarify, our claim is to the title of the world’s most accurate noun-sense disambiguation API – this is an important distinction.
For the application for which we originally devised the API, namely search, focusing on nouns made perfect sense. Since, when you are retrieving a document, you care a lot more about whether it is about a veteran vs. veterinarian than you care about whether the document is about a loyal or grumpy veteran.
The primary reason that we’ve managed to achieve the accuracy and speed that we have with our API, is that we’ve taken a completely alternative approach to Word Sense Disambiguation (WSD), an approach different to that which you would usually find in NLP algorithms. Focusing on nouns allowed us to tread this path.
Instead of taking the machine-learning path, we’ve adapted our patent pending data-mining algorithm to data-mine the English language in the form of Wordnet (plus some additional ‘secret sauce’).
One of the major advantages of our approach over traditional NLP is that we achieve our results without compromising performance.
We understood right from the beginning that without sufficient performance, even the best WSD would not be useful. Our API’s speed opens up our algorithm to some very interesting high-volume applications.
As our core algorithm is not based on machine learning techniques, most the challenges we faced involved mining relevant semantic data from Wordnet and filtering out the noise, as well as adapting the data so it would fit within the constraints of our algorithm.
Focusing on getting the input data to a point in which its signal-to-noise ratio was better, rather than the usual machine learning approach with its reliance on fine tuning to a particular corpus, resulted in an algorithm which is more robust and isn’t corpus specific.
We’re proud of the fact that our API works equally well on any source of contemporary English text. This is proven by equivalent scores on various benchmarks that use widely different types of text.
As a side note, we do use some NLP technology in the form of a part-of-speech tagger (Morphadorner) which did require some fine-tuning. Especially since it was originally trained on early 20th century literature. Strangely enough it works significantly better than some other taggers which were trained with more contemporary texts.
Still, use of our tagger is optional. If you’d rather use a different tagger, our API allows you to provide input which is already annotated with part-of-speech.
PW- In your opinion, How important is open source sharing for improving libraries and algorithms for advancing computer science?
Tal- Our entire team at SpringSense are firm believers in open source. Open-source sharing is critical to improving both libraries and algorithms. Especially in computer science, we see open-source as an extension of the peer-review system, which we also believe should be free and open.
Along our journey to developing this technology we’ve used, and contributed back to many open-source projects – some in the NLP space (Morphadorner, JAWS) and some around the operational side of running the clusters we require for our data-mining (StarCluster, Chef, Ironfan, Politburo).
As it stands, we would love to be able to release all our technology as open-source, but the business models available for this type of technology preclude it. Especially tough was the decision to patent our algorithm it as we’re not believers in software patents. It is unfortunate that in the current patent system one is forced to be armed with defensive patents to be able to defend against patent trolls.
PW- What are some of the ways you help to create awareness and excitement for Developers for your API. What is your pricing strategy?
Tal- We’re really more technologists than marketers here at SpringSense, so we have been trying to raise awareness by helping developers build cool apps using our API and hoping that this helps them succeed which in turn generates excitement for our work. We provide a very generous free tier to our API that we hope allows fledgling applications to get their feet wet.
We try to encourage use of our APIs at hackathons, such as the one that Mashape, our API marketplace, is sponsoring at APIDays Mediterranea ( http://mediterranea.apidays.io/ ). We’ve also sponsored some hackathons closer to home here in Australia that resulted in some very interesting applications.
Additionally, we contributed free use of our API towards Random Hacks of Kindness events through our parent company DiUS. If you are building an application for a not-for-profit world-improving organization that you think would benefit from our API, talk to us, we’d be keen to contribute free use of the API.
PW- Describe the landscape of Text Mining and NLP APIs that you currently track, monitor or benchmark- and Where your strategic direction intends to be? What are your plans for 2013
Tal- NLP is a fast moving landscape and we certainly try to keep abreast of the latest and greatest. There’s almost too much happening to keep on top of it!
On the commercial side of things we try to track what kind of interesting applications are making use of NLP technologies, especially ones that may be further augmented by using our own.
We’re quite keen to find a team doing sentiment analysis that would like to collaborate– we believe that by preprocessing their training corpus with our API they could achieve greater accuracy across wider types of input.
On the academic end, we’re constantly tracking the latest research to see whether some of the results can be applied to further improve the speed and accuracy of our API. Because our approach is somewhat tangential to the prevalent NLP methods, some of the improvements come from unexpected places. We’re also looking forward to seeing some of the interesting results and benchmark contenders to come out of the SemEval 2013 in Atlanta, Georgia next month.
Another point of interest for us is the impending open-sourcing of Numenta’s cortical learning algorithm — we are keen to see if that would lead to any interesting NLP applications.
Our strategic direction for 2013 is all about helping developers discover new uses for our API and hoping to drive its use.
Meanwhile, in our R&D lab we’re exploring ways of increasing the accuracy and speed of our API even further, I wouldn’t want to reveal too much, lets just say that we’d like to mine an even richer and more current representation of the English language than Wordnet.
Tal has over fifteen years experience in software development and consulting, having worked with large firms such as Mercury Interactive (now part of HP), Sensis, Toyota, Westfield, US Airways and Motorola. He can be reached on twitter: @rotbart
Accurate noun-sense disambiguator, anyone? Better interpretation of search results? Just another Springsense API call away!