Text Analysis Spiders

One of the most exciting directions in text analysis is the adaptation of spiders, trackers and aggregators so that they can gather just-in-time texts (jitexts) for further analysis. This could open up text analysis to cultural studies researchers and make it a playful way to comb the internet. Most of the tools out there start with Google as their spider – do we need to create our own index so as to avoid depending on Google?

Here is the framework of an idea:

THE IDEA

A number of us are working on just-in-time aggregators that can gather text corpora that suit research. The idea would be to design a system that would do the following:

1. Spider

A spider that would build a searchable index of online knowledge around a given domain. The spider would not try to index the whole web, but would be focused and trained to gather web pages around digital humanities and electronic texts. Where possible this spider would try to infer the metadata needed for DUCTapor so that it could be searched within DUCTapor and hence through the TAPoR portal.

It might be possible to try to train the spider to gather the full text of electronic texts of interest to humanists too, though I’m not sure how that might work. The idea here is to build a mini Internet Archive of e-texts that are public.

It might be possible to try some summarizers on the spidered data to save a higher level of knowledge about clusters of materials that could be easily edited into a DUCTapor entry. Some of the types of entities that we would try to index could be:

  1. People – who they are connected to, what projects they work on, what sites they are associated with
  2. Projects – what they about, who is on the projects, what materials they make public
  3. Collections and E-Texts – what is out there and in what form, who is connected to these original texts and what their status is

2. Aggregator

An aggregator which given a keyword or the appropriate metadata criteria would generate a just-in-time text from the spidered archive. This would provide us an alternative to an aggregator that depends on Google or another commercial index. Such a just-in-time text could then be passed through the portal to other tools for analysis. Such an aggregator might be something that could be restricted to electronic texts rather than pages about a keyword if we were able to figure out a way of identifying texts *by* Plato, for example, rather than *about* Plato.

An aggregator might also be able to work from a set of starting points that come from DUCTapor or from the user. It could then follow rules to grab all the stuff within so many links.

3. Tracker

A tracker that can be given a set of keywords, metadata criteria, or starting points and asked to wander and report back at regular intervals developing a diachronic picture of the web pages that match. This could report back at weekly/monthly intervals with a summary of what has been found. It could save the materials in a format that could passed as a Just Over Time text to other TAPoR tools. TAPoR users could launch these agent trackers and then manipulate the incoming results over time.

Comments are closed.