Here are some links to blogs, sites and books I have found useful and which you may too.

About the Project

I started working on text mining and information retrieval sometime around 2007, and I am still learning. This project is a work-in-progress, and contains some components and building blocks I built as part of learning this stuff. As I learn more, more code will be added. Use it if you find them useful.

I plan on using this site as a placeholder for this code. There is no formal documentation. However, I usually talk about them in my blog. Links to posts that deal with particular aspects of the code in the project are listed below. I am also usually pretty liberal with inline comments (Javadoc and non) in my code, so you may want to download the source code and generate the Javadocs locally if that makes sense, or read through the code.

If you have questions, please post it as a comment to the relevant blog post, that way you have a better chance of getting an answer, either from me or from other readers. If you find bugs, it would be awesome if you can send me a patch through the tracker, else just point it out on the relevant blog post.

Links to specific posts

Here are links to some of my blog posts covering parts of the code in the project as I built them.

  • Vector Space Classifier using Lucene
  • Binary Naive Bayes Classifier using Lucene
  • Summarization with Lucene
  • IR Math in Java : Citation based Ranking
  • IR Math in Java : Rule based POS Tagger
  • IR Math in Java : HMM Based POS Tagger/Recognizer
  • Phrase Spelling Corrector using Word Collocation Probabilities
  • IR Math in Java : Experiments in Clustering
  • IR Math in Java : Cluster Visualization
  • IR Math with Java : Similarity Measures
  • IR Math with Java : TF, IDF and LSI
  • Ontology Persistence with Prevayler
  • Modeling an Ontology in memory with JGraphT
  • Parsing OWL XML with StAX
  • Tokenizing Test : Token Recognition
  • Tokenizing Text with ICU4j's RuleBasedBreakIterator

  • Licensing

    The code in this project is released under the Lesser GNU General Public License (LGPL) which pretty much allows you to incorporate code from this project into your own (commercial or otherwise) project without fear of liability and without an expectation from you to open source your own project.

    About the Site

    I got the template for the site from Open Source Web Design, where it was contributed by Craig from DesignCreek (Thanks Craig). I choose the template because the pencil caps reminded me of overlapping bell curves, and because pencil and paper are tools that are typically forgotten when we talk about all this fancy computer stuff, and because they symbolize collaborative learning, which is pretty much what I hope to do with this site.