Jorge J Villalon, Ph.D. (w)

Tuesday, May 22, 2007

Developing for Sakai 2.4.x

Ok, we have a new version of Sakai, 2.4.x. And I have to use it in order to integrate the Text Mining features we are implementing with an e-learning platform.

Installing and using Sakai is a nightmare, now we have a new version and new documentation, I'll try to follow the process and be able to compile my code with Sakai, I'll post the results in this blog.

Tuesday, May 15, 2007

Latent Semantic Indexing and the order of things

Ok, Latent Semantic Indexing is a technique for information retrieval where you build a high dimensional representation of a set of documents (or parts of documents), then you can measure distances between documents projected on this space.
I'm working with this for two reasons:

There is a project that is actually paying me a scholarship, and this project promises to use the latest techniques for text mining.
Some evidence shows that "common sense" and "intuition" can be actually captured by this technique, precisely the kind of things that I think represent better the way people understand and value some things, and in consequence the way they "measure" learning.

One big problem with LSI is that you don't have a common and clear source of information. It's like everyone has a different opinion on what it is, how to use it, etc. For example, in this moment I'm focused on the problem of building a semantic space, for this there are 8 steps:

Choosing documents: It could be a group of papers, documents or books, then each document could be split in paragraphs, fixed segments or divided by a linguistic criteria.
Tokenizing: Parsing the document to find URLs, abbreviations, words, etc.
Removing stop words: Some words do not have the same semantic information, so they can be removed.
Stemming: Run and running should have the same semantic information so we can just use run, the same process is repeated for every word.
Term selection: When you process text you get thousands of terms, usually terms that appears in just one document or those that appears in every document are removed (specially if you think in a way to differentiate documents).
Term weighting: There are more than 10 different ways of doing this, and not much evidence about which one to use. By now I'm using LogEntropy.
Dimensionality reduction: This is the key part for LSI, it's about decomposing a term by document matrix using singular value decomposition, then reducing the number of singular values (usually to 100) and finally reconstructing an approximation of the original matrix that reflects the way in which humans associate terms.
Normalization: Whatever dataset you have, you can normalize, per attribute, per instance, etc. All I know now is that per instance is useful to calculate the cosine of the angle between documents in the space, but I haven't found any evidence that supports this in any other way.

Well, there are so many ways to do all the previous steps that I'm really trying to put everything in order but it's pretty hard.
I'll see how it comes.

Sunday, March 11, 2007

The dynamic collection issue

In text mining you always have a collection of Documents, usually called a corpus, against most operations performs. This corpora is usually used for training in machine learning algorithms. Now we want to use text mining to give feedback to students in English composition, for this we will have a dynamic collection of documents, the student's essays. This collection will be changing its size while students submits their essays, and it should stabilize within a week or so.
There's a number of questions that arise in this case:

Normal courses have anything from 10 to 300+ students. Do the size of the corpus make a big difference when it comes to feedback?
1 to 10 assignments during a semester is normal, the corpus should include all assignments for feedback or just the collection formed by one assignment?
Somewhat similar to 2. Should we build a big corpus for the feedback or user small ones that groups about topics?
The calculation of several statistical techniques are prohibitive within an online interactive environment due the size of the term/document matrix. This implies that we have to do some feature selection before calculating principal components. A lot of questions arise from this issue:

How many features are needed for student feedback?
As the collection changes, what do we do with new features? Do we calculate everything again?
Do we have different importance for features that comes from training than those that comes from the students essays, how do the express that difference?

How can we build a model that reflects all this issues so we can compare and answer all these questions?

Well, these are the questions, by now I'll start coding in the simplest way so we can see what the system shows.
A little problem with Sakai is that now in version 2.3.1 resources are not getting indexed in their whole content, don't know why. By now I will try to make a document about what is getting indexed and what is not.

Wednesday, February 14, 2007

The Sakai installation nightmare

Ok, a lot of water has passed down the bridge, a few weeks of work and creating Tag Clouds for Sakai using Lucene indexes.
The installation nightmare was terrible and it seems that it finally ended, for a while.... the problem I think is the incredible pressure that frameworks impose on developers. As there are so many things to care of, obviously, as human beings that we are, we just forgot.
Two important bugs were taking care of our problems:

When upgrading to Hibernate 3, the people from Sakai forgot to change the declaration of some DTDs, the bug is here. The found the problem and created a patch, I tried the production branch and it didn't work... it was hard to realize that when they committed the patch, they forgot one file... ok, I posted the problem and now it's fixed.
But when we were trying to install from the 2.3.x branch again, the errors came up again, but why? Well, Dimitri found that a set on the build.properties file we were using was the problem, we set maven.build.dir to a different place than target, just to hide it in Eclipse, it was changed to ".target". The problem was that a tool named samigo had a "hard coded" path using the target build.dir, why this happens? because there are too many pieces to move and people just forget. Well, Dimitri reported the bug, another little improvement. Now we have to make Sakai work properly with our code.

Tuesday, January 30, 2007

Installing Sakai

Ok, the facts:

Sakai is VERY HARD to install, too many versions, poor documentation, and nobody doing something to ease the process.
Sakai has a lot of engineering time spent on its architecture so it has features and is able to scale like no other open source LMS.
Apache's Lucene is a great solution for what i want to do, it parses documents and even create the TFIDF vectors, so i want it.
We have to convince some teachers about letting us use the data from essays that their students has uploaded to the course site. So we need an LMS.
Lucene is already integrated with Sakai, i.e. I have to use it....

I promise that if a have the time during my research to do something for all the people that has to live the nightmare of installing Sakai and making it work.... I'll do it.

Starting my Ph.D.

Ok, I'm starting this posts as a way of keep a log of the thoughts and things that are already happening to me during my attempt to get a Ph.D. from the Uni of Sydney.
I already have my first two tasks:

Use the copy machine to get 10 copies of a chapter about "How to write a decent thesis" and another 10 copies of a form called "What do you think your advisor is useful for"
Build a prototype of a Text Mining Service to integrate with an LMS by July, so we can test it on the second semester and then try to convince some teachers to use it and get a whole bunch of data to work with.

To be honest, by the time I'm writing this, I already finished the first task. It wasn't so hard, but believe me, the copy machine can be sometimes a little more complicated than an Artificial Intelligence algorithm.