Extracting Semantic Information from Navigational Paths on Wikipedia

Update: I am currently in the process of organizing all of my code snippets and uploading them on to GitHub! The semantics-pathtools project covers many of the mentioned methods to extract semantic relatedness from human navigational paths on Wikipedia! Wikipedia is one of the most visited pages in the web, and that’s not something unusual. Like no other webpage, Wikipedia embodies the spirit of the Web 2.0: Users can freely and easily contribute their knowledge or create links to other interesting articles, and collaboratively, the community created the world’s biggest encyclopedia with roughly 5 million articles in the English version

Continue reading »

Minipost: Exploiting Spark as a ThreadPool

Apache Spark is a great tool for massively distributed data analysis and manipulation. It also features a machine learning library, but in the python variant, it is unfortunately nowhere near as good as the awesome scikit-learn library, especially in combination with numpy and scipy, while the linear algebra tools in PySpark make it seem that the coders didn’t even try to bother. This stands next to the fact that for now, my models are small enough to train them using the aforementioned python libraries. In my current setting, I run a single training procedure with one single parameter 100 times.

Continue reading »

Reading CSV files in Apache Spark

We lately acquired a small cluster of machines to run Hadoop on them. We’re running the Cloudera distribution with an installed instance of Apache Spark. Since I found quite a liking to Python, I’m heavily using PySpark. I’ve worked quite a bit with Pandas, but Pandas does not easily support multiprocessing (I’ve tried some easy workarounds, but they’re just that: workarounds). While I found that the PySpark DataFrames can do similar things as the Pandas DataFrames, I could not find an easy and robust way to read CSV files with huge loads of highly faulty user data (such as newlines

Continue reading »

Dataset Collection for Evaluating Semantic Relatedness

I am researching algorithms to calculate semantic relatedness. Since the purpose of the algorithms is to approximate human judgment as close as possible, it is necessary to compare the algorithm results to human intuition of the same pairs. For that, there exist several datasets which are more or less widely used in current literature. I uploaded a tar file with some evaluation datasets in a pandas readable format with headers, which can be downloaded under http://www.thomas-niebler.de/downloads/eval_datasets_pandas.tar.gz. The file contains the following datasets: Miller&Charles [1] Rubenstein&Goodenough [2] MEN dataset [3] MTurk dataset [4] WordSimilarity353 [5] For the WordSimilarity353 dataset, I also included the distinct scores

Continue reading »

Results from our human intuition experiment at the WWW’13 conference, Rio de Janeiro

Hello all, at the WWW’13 conference in Rio de Janeiro, we presented our poster on “Calculating Semantic Relatedness from Human Navigational Paths on Wikipedia” (see my BibSonomy profile). We also collected some user intuitions about the semantic relatedness of “bread” and “butter”, two exemplary words from the WordSimilarity353 dataset. The WordSimilarity353 dataset contains 353 pairs of english words or names, combined with a human judgment of semantic similarity between these word pairs. For all word pairs, at least 10 different persons rated the similarity by assigning a value between 0 and 10, where 0 represents no relation and 10 represents

Continue reading »