I'm working with this for two reasons:
- There is a project that is actually paying me a scholarship, and this project promises to use the latest techniques for text mining.
- Some evidence shows that "common sense" and "intuition" can be actually captured by this technique, precisely the kind of things that I think represent better the way people understand and value some things, and in consequence the way they "measure" learning.
- Choosing documents: It could be a group of papers, documents or books, then each document could be split in paragraphs, fixed segments or divided by a linguistic criteria.
- Tokenizing: Parsing the document to find URLs, abbreviations, words, etc.
- Removing stop words: Some words do not have the same semantic information, so they can be removed.
- Stemming: Run and running should have the same semantic information so we can just use run, the same process is repeated for every word.
- Term selection: When you process text you get thousands of terms, usually terms that appears in just one document or those that appears in every document are removed (specially if you think in a way to differentiate documents).
- Term weighting: There are more than 10 different ways of doing this, and not much evidence about which one to use. By now I'm using LogEntropy.
- Dimensionality reduction: This is the key part for LSI, it's about decomposing a term by document matrix using singular value decomposition, then reducing the number of singular values (usually to 100) and finally reconstructing an approximation of the original matrix that reflects the way in which humans associate terms.
- Normalization: Whatever dataset you have, you can normalize, per attribute, per instance, etc. All I know now is that per instance is useful to calculate the cosine of the angle between documents in the space, but I haven't found any evidence that supports this in any other way.
I'll see how it comes.
1 comment:
Hi, Jorge
Great to see another latino in IR. Good luck with your PhD.
Regarding LSI, feel free to check our list of tutorials and fast tracks at http://www.miislita.com
Cheers
Dr. Edel Garcia
Post a Comment