I am researching algorithms to calculate semantic relatedness. Since the purpose of the algorithms is to approximate human judgment as close as possible, it is necessary to compare the algorithm results to human intuition of the same pairs. For that, there exist several datasets which are more or less widely used in current literature.
I uploaded a tar file with some evaluation datasets in a pandas readable format with headers, which can be downloaded under http://www.thomas-niebler.de/downloads/eval_datasets_pandas.tar.gz. The file contains the following datasets:
- Miller&Charles 
- Rubenstein&Goodenough 
- MEN dataset 
- MTurk dataset 
- WordSimilarity353 
For the WordSimilarity353 dataset, I also included the distinct scores of each rater.
I also published a BitBucket repository with a small toolbox written in python. This repository also contains the data at hand.
Have fun using these data!