Minipost: Exploiting Spark as a ThreadPool

Apache Spark is a great tool for massively distributed data analysis and manipulation. It also features a machine learning library, but in the python variant, it is unfortunately nowhere near as good as the awesome scikit-learn library, especially in combination with numpy and scipy, while the linear algebra tools in PySpark make it seem that ...

Project: Strava Club Challenges

Strava is a social network for sportsy people (similar to Fitocracy), but focused mainly on endurance sports, such as running, cycling or swimming. It offers a great deal of options to track and analyze your training and your progress (and even more in the Premium Version) and allows you to be actually social in so-called clubs ...

Writing your thesis: A meta-list of advice!

Rather than repeating stuff that others repeatedly repeated, I’ll just list some useful and interesting links about how to write your thesis (or just another paper) with LaTeX. This list will probably steadily be extended. VERY basic approach to writing a BASIC thesis. Also still uses only BibTeX. http://texblog.org/2012/06/08/writing-a-thesis-in-latex/ Good overview over the whole process ...

BibSonomy + BibLaTeX: A powerful combination

As a PhD student, I often have to deal with citing someone else’s work in my publications. Up until now, me and my colleagues used the standard combination of pdflatex and bibtex. Works pretty well, doesn’t have too many features. To store my bibliography, I use BibSonomy. I can browse my collection using tags and ...

Reading CSV files in Apache Spark

We lately acquired a small cluster of machines to run Hadoop on them. We’re running the Cloudera distribution with an installed instance of Apache Spark. Since I found quite a liking to Python, I’m heavily using PySpark. I’ve worked quite a bit with Pandas, but Pandas does not easily support multiprocessing (I’ve tried some easy ...

Dataset Collection for Evaluating Semantic Relatedness

I am researching algorithms to calculate semantic relatedness. Since the purpose of the algorithms is to approximate human judgment as close as possible, it is necessary to compare the algorithm results to human intuition of the same pairs. For that, there exist several datasets which are more or less widely used in current literature. I ...

Silvesterlauf Estenfeld

Silvesterlauf Estenfeld
Nach einer heißen Dusche, in warmen, gemütlichen Klamotten und mit leckerem Essen schreibe ich nun diesen Kurzbericht über den heutigen Silvesterlauf in Estenfeld. Der Lauf besitzt den Charakter eines Hobbylaufs ohne Zeitnahme oder Ranglisten, geschweige denn einer Siegerehrung. Die Teilnahmegebühr ist in der Höhe nicht festgelegt und wird direkt gespendet. Gerade das macht die Veranstaltung aber ...

New pages on the blog

Hello to whoever reads this blog! I just added two new pages. The first page contains all my publications as hosted on my page on BibSonomy. If you want to know anything about any of these, just drop me a message. The other page shows all my planned events where I go running somewhere around ...

LDAP-Nutzerverwaltung mit Gosa

So, nach langer Zeit mal wieder ein lohnender Blogeintrag (zumindest für mich 🙂 ) Wir wollen am Lehrstuhl eine zentrale Userverwaltung für die Windows- und Linuxkisten mit geshareten Homeverzeichnissen. Für die Userverwaltung sollte außerdem noch ein einfaches Webfrontend her. Das war dann in diesem Fall eben Gosa (https://oss.gonicus.de/labs/gosa). Nach langem Rumfrickeln und -fluchen aufgrund vieler ...

Results from our human intuition experiment at the WWW’13 conference, Rio de Janeiro

Hello all, at the WWW’13 conference in Rio de Janeiro, we presented our poster on “Calculating Semantic Relatedness from Human Navigational Paths on Wikipedia” (see my BibSonomy profile). We also collected some user intuitions about the semantic relatedness of “bread” and “butter”, two exemplary words from the WordSimilarity353 dataset. The WordSimilarity353 dataset contains 353 pairs ...