Network Fragments, Analysis of E-Mail

One emerging form of structured text analysis is the analysis of large corpora of e-mail. See Social Network Fragments and InFlow and Email Datamining. Both projects create visualizations of networks much as Steve Ramsay does with StageGraph.

An interesting question for TAPoR is whether we can build an aggregator that can build a corpus from e-mail that could be used by other tools.

There would be two parts to this:

1. An aggregator tool that could suck up e-mail (an uploaded file, traversing an archive, or listening to a discussion list) and encode it for further analysis.

2. Adapt the TAPoRware tools so that they could work on an e-mail corpus. Some of the tools might include:

2.1 Frequency sorted word list with the possibility of getting word lists for date ranges or authors.

2.2 Distribution graphs of patterns over time or authors.

2.3 Visualizations that might show a weighted centroid with words. Authors or dates might be the outter ring segments. (The percentage of words by an author would determine the amount of the pie ring.)

2.4 Collocations for words and authors/dates sorted by Z-score.

2.5 Ability to search for phrases, words, collocations and patterns. Ability to restrict by codes.

2.6 Visualizations of networks.