Text Vectors – Theoreti.ca

Most text analysis techniques present a synchronic view of the text. For example, a list of word frequencies treats the text as a whole. How can we look at change across a text? How can we quantify a text as it progresses, whether in writing or playing? Could we anticipate the sorts of words likely to be used or summarize those used before?

Here is a technique adapted from a hypertext project I saw demonstrated by Apple years ago when Brenda Laurel worked for them. I am not sure of their details, so this is how I would do it.

What they did was create a vector for each node in the hypertext. The vector was the position of the node in an n-dimensional space where each dimension is a word or pattern. Thus a node that has two instances of a pattern that you are tracking would be located at “2” on the metric for that dimension. If you track 200 content words you have a 200 dimensional space and each of your nodes is located somewhere in that space. (This is similar to what John Bradley and I did for the Hume visualization project Simweb: SIMWeb Query Form or for an explanation: Help with Correspondence Analysis.)

What is interesting is that we can then follow a user (ideal or actual) from node to node and build a trajectory through the n-dimensional space that allows us to calculate where they might be going. If a user goes through 5 nodes and now has 3 possible nodes ahead we could calculate which of the 3 is closest to the word frequency trajectory of his/her choices.

Likewise we could also calculate an idea word frequency for where he/she came from (that might be different from the actual node they came from.)

While this makes sense in a hypertext there is a theoretical problem in a linear text. The problem is that the sequence of words, paragraphs, and chapters is not necessarily the sequence of reading. We fall into the trap, when developing distribution graphs of equating the progression of the string of characters with an ideal progression of reading. What would actually be more interesting with a traditional text would be to see how each chapter subverts the expected trajectory based on the chapters before.