Sunday, March 11, 2007

The dynamic collection issue

In text mining you always have a collection of Documents, usually called a corpus, against most operations performs. This corpora is usually used for training in machine learning algorithms. Now we want to use text mining to give feedback to students in English composition, for this we will have a dynamic collection of documents, the student's essays. This collection will be changing its size while students submits their essays, and it should stabilize within a week or so.
There's a number of questions that arise in this case:
  1. Normal courses have anything from 10 to 300+ students. Do the size of the corpus make a big difference when it comes to feedback?
  2. 1 to 10 assignments during a semester is normal, the corpus should include all assignments for feedback or just the collection formed by one assignment?
  3. Somewhat similar to 2. Should we build a big corpus for the feedback or user small ones that groups about topics?
  4. The calculation of several statistical techniques are prohibitive within an online interactive environment due the size of the term/document matrix. This implies that we have to do some feature selection before calculating principal components. A lot of questions arise from this issue:
    1. How many features are needed for student feedback?
    2. As the collection changes, what do we do with new features? Do we calculate everything again?
    3. Do we have different importance for features that comes from training than those that comes from the students essays, how do the express that difference?
  5. How can we build a model that reflects all this issues so we can compare and answer all these questions?
Well, these are the questions, by now I'll start coding in the simplest way so we can see what the system shows.
A little problem with Sakai is that now in version 2.3.1 resources are not getting indexed in their whole content, don't know why. By now I will try to make a document about what is getting indexed and what is not.