TAPoRware Features

We are releasing version 1.0 of the TAPoRware Tools. (You can get the version 1.0 now, but we there are some loose ends to clean up.) That got me thinking about the next version. Stan Ruecker and Zachary Devereux of the University of Alberta gave a paper at the Face of Text on Scraping Google and Blogstreet for Just-in-Time Text Analysis which showed the potential for certain tools and included a list of features they would like. Stan kindly sent me the list so I could weave it into my list.

Here is Stan and Zack’s list for TAPoRWare plus. The narrative is my interpretation.
Feedback
Better feedback for errors and perhaps a way for users to provide feedback to the developers.
Interface
We need interfaces that make it easy to run tools on texts and then move results around. This is what the portal should do, but we can also imagine interfaces that pipeline things like Eye-ConTact.
Stemming
The ability, for a given language, to search for the forms of a word. This could be a tool that would output a list that could be fed back into other TAPoRware tools, but it is probably easier to implement in the portal. TACT had an interesting variant called SIMIL that used an algorithm that would find similar words regardless of linguistic relations.
Stop lists for function words
We have added this to the new version, but we may not have it right.
Fixed phrase location
This is presumably an expansion of what the NYU team did.
Extraction of proper nouns
Yes! We have already started to think about how to extract names, dates, and places. Proper nouns should be part of this.
Multiple collocate keywords
I assume by this is meant more than two words collocating – as in a list of 5 words in a span like a paragraph. I wonder if this is easy to do as an extension of the collocate tool?
Automatic theme identification (Dominic)
Yes, we are working on summarization and theme analysis, but we need to learn more about the algorithms out there.
Multiple files and larger files
Yes, the multiple files will be handled by an aggregator and the portal. Larger files may be more a matter of more processing power or providing a command line version for people to run on their own box.

Comments are closed.