Jorge J Villalon, Ph.D. (w): Latent Semantic Indexing and the order of things

Ok, Latent Semantic Indexing is a technique for information retrieval where you build a high dimensional representation of a set of documents (or parts of documents), then you can measure distances between documents projected on this space.
I'm working with this for two reasons:

There is a project that is actually paying me a scholarship, and this project promises to use the latest techniques for text mining.
Some evidence shows that "common sense" and "intuition" can be actually captured by this technique, precisely the kind of things that I think represent better the way people understand and value some things, and in consequence the way they "measure" learning.

One big problem with LSI is that you don't have a common and clear source of information. It's like everyone has a different opinion on what it is, how to use it, etc. For example, in this moment I'm focused on the problem of building a semantic space, for this there are 8 steps:

Choosing documents: It could be a group of papers, documents or books, then each document could be split in paragraphs, fixed segments or divided by a linguistic criteria.
Tokenizing: Parsing the document to find URLs, abbreviations, words, etc.
Removing stop words: Some words do not have the same semantic information, so they can be removed.
Stemming: Run and running should have the same semantic information so we can just use run, the same process is repeated for every word.
Term selection: When you process text you get thousands of terms, usually terms that appears in just one document or those that appears in every document are removed (specially if you think in a way to differentiate documents).
Term weighting: There are more than 10 different ways of doing this, and not much evidence about which one to use. By now I'm using LogEntropy.
Dimensionality reduction: This is the key part for LSI, it's about decomposing a term by document matrix using singular value decomposition, then reducing the number of singular values (usually to 100) and finally reconstructing an approximation of the original matrix that reflects the way in which humans associate terms.
Normalization: Whatever dataset you have, you can normalize, per attribute, per instance, etc. All I know now is that per instance is useful to calculate the cosine of the angle between documents in the space, but I haven't found any evidence that supports this in any other way.

Well, there are so many ways to do all the previous steps that I'm really trying to put everything in order but it's pretty hard.
I'll see how it comes.

Jorge J Villalon, Ph.D. (w)

Tuesday, May 15, 2007

Latent Semantic Indexing and the order of things

1 comment:

Blog Archive

About Me