Extracting Semantic Information from Navigational Paths on Wikipedia

Update: I am currently in the process of organizing all of my code snippets and uploading them on to GitHub! The semantics-pathtools project covers many of the mentioned methods to extract semantic relatedness from human navigational paths on Wikipedia!

Wikipedia is one of the most visited pages in the web, and that’s not something unusual. Like no other webpage, Wikipedia embodies the spirit of the Web 2.0: Users can freely and easily contribute their knowledge or create links to other interesting articles, and collaboratively, the community created the world’s biggest encyclopedia with roughly 5 million articles in the English version alone!

Research about Wikipedia

Naturally, Wikipedia has been and still is in the focus of many researchers. Because most articles are concise and well-written, Wikipedia is a valuable resource for many NLP tasks. One prominent example is given by the ESA or Explicit Semantic Analysis method, which basically parses the whole of Wikipedia and creates a TF-IDF term-document matrix, from which the semantic relatedness, i.e. the inherent degree of relatedness of the underlying concepts, between two words can be calculated. Compared with the WordSimilarity353 dataset, the achieved correlation of the ESA [7] similarity scores is around 0.76, which actually is pretty good (the maximum value for a correlation measure is 1). Lately, this value has been exceeded by Word2Vec by Mikolov et al., which uses a simple neural network to learn word embeddings from raw text [4].

While the content of the Wikipedia articles provides a huge knowledge resource, there has been some work concerning itself with the link network of Wikipedia, another big asset of the encyclopedia. Because links should mostly only be set if two articles are somehow related, this link network represents a slim knowledge base with equally resourceful semantic content.

Navigating Wikipedia

Only recently, researchers have focused on the actual usage of this link network. While the static link network contains user intuition, because they have been collaboratively created, the actual navigation on them is an even more direct way to capture effects of human intent and intuition.

Navigation games on Wikipedia

The earliest works started out with a data from navigation game called Wikispeedia. The authors of this game use a reduced corpus of about 4k articles as well as a reduced link network (see https://snap.stanford.edu/data/wikispeedia.html). Players are given the task to navigate from one source article to a specified target article using only the provided links in the article text. Using these navigation traces, the authors designed a semantic relatedness measure between the article titles. Since every article represents a single concept, this measure thus gives a rough notion about the relatedness of the underlying concepts.

Although no research project itself, the WikiGame provides a similar concept as WikiSpeedia, but renders pages directly from Wikipedia, removing all tables, images and category links in the process, also not allowing disambiguation pages or use of the search function. Users are still given a pair of Wikipedia articles, where they have to reach the latter using the link network, while starting with the former. Users can choose the game type, though: The standard game type is a Speed Race, where the fastest user wins. Other game types are “Least clicks to win”, “no USA” and “five clicks to Jesus”.

Unconstrained navigation on Wikipedia and its differences to game navigation.

However, navigation in these games differs from “real”, unconstrained navigation in several ways, as we argued in [1]. First and foremost, the navigation intent is different. Where game users want to reach a goal as fast as possible, they make use of their “inner map” of concepts and their intuition of relation between concepts to select the most promising link, hopefully getting nearer to their target page. On the other hand, users browsing the original Wikipedia either discover new things by clicking on supposedly interesting links or they use the advanced functions of Wikipedia, such as the search function inside of Wikipedia or even Google.

Another difference of game and unconstrained navigation is made obvious when investigating the 1000 most visited pages. We categorized each of them into one of four categories according to their Wikipedia Categories: Person, Movie or Other. In the WikiGame data, most visited pages have been “Other” pages, i.e., many conceptual pages, such as “Table” or “State”. The Clickstream data, however, are comprised mainly of actual Person pages, such as “Jennifer Lawrence” or “Barack Obama”.

Making the Semantic Component in Navigation Visible

As I already mentioned, the earliest work on the semantic content of navigation on the Wikipedia network was done on a small subset of the Wikipedia graph. West et al. [6] presented the GWOP WikiSpeedia, with which they collected navigation data. In their work, they presented a similarity measure for concepts based on these navigation paths. It is defined as

    \[d(a,g) = \frac{1}{m} \overset{m}{\underset{k=1}{\Sigma}} d_{p_k}(a, g)\]

for paths p_1, …, p_m and the path-dependent distance measure d_{p_k}(a, g), which is defined for the article a visited while navigating to article g. Note that this measure does not yield a value, if a and g never occurred on the same path. Also, this approach is not applicable when we do not know the navigation goal.

In [2], we proposed a co-occurrence based approach to represent an article a_i through its navigation context, i.e., which articles a_j were visited before and after. Which articles are considered as context was defined by a given window size k:

    \[ coocc(a_i, a_j) = |\left\{ p \in \mathbb{P} : a_i, a_j \in p \text{ within click distance of at most }k\text{ clicks} \right\}| \]

This results in a (symmetric) co-occurrence matrix

    \[\mathcal{C} := ( coocc(a_i, a_j) )_{ij},\]

where each column represents the context of an article a_i.
We could also show that with only a certain portion of navigational paths with specific characteristics, the resulting word vectors contained semantic information higher than vectors from the state-of-the-art at the time, the ESA method.

As noted above, navigation in the WikiGame is fundamentally different than unconstrained, real navigation on Wikipedia. In [3], we focused on analyzing unconstrained navigation on Wikipedia obtained from the Wikimedia Research Group and experimented with adopting the method proposed in [5] to that unconstrained navigation. While the simple application of the co-occurrence method on unconstrained navigation yielded acceptable results, we proposed to only count the binary co-occurrence of two pages instead of the numeric co-occurrence. This way, we could increase the semantic content of the vector representations notably.

While we could point out differences in navigation behaviour in GWOPs on Wikipedia and unconstrained navigation, we still lack navigation data from specific users. The Clickstream datasets only contain transition data between two Wikipedia pages, collected across all users. In order to simulate potential user navigation, we adopted a random walk approach [1]. For this, we experimented with different settings about the path length. A Github Gist on how to create random walks with Apache Spark can be found at https://gist.github.com/thomasniebler/03c85200aecb55c256ce152352fa46f9.

Furthermore, we explored the application of Word2Vec on navigational data. We found that it yields good results to generate Word2Vec models from this navigation. They also outperformed the previous co-occurrence counting approaches on most random walk subsets, except on those random walks that we parameterized with how humans would walk. Also, this approach seemed successful enough that Wulczyn created word embeddings from navigation on Wikipedia and made them publicly available.

Conclusion

There have been many methods proposed on how to extract semantic relatedness information from Wikipedia navigation. These range from strictly path-based models to generating page embeddings. Generally, it can be seen that Wikipedia navigation contains a fair amount of semantic information and thus provide a viable source for further applications, such as recommending the next page for a user to click.

  • Niebler, T., Schlör, D., Becker, M., Hotho, A.: Extracting Semantics from Unconstrained Navigation on Wikipedia. KI -- Künstliche Intelligenz. 30, 163--168 (2016).
    BibTeX EndNote Download URL
  • Wulczyn, E.: Wikipedia Navigation Vectors, /brokenurl#10.6084/m9.figshare.3146878.v2, (2016).
    BibTeX EndNote URL
  • Singer, P., Niebler, T., Strohmaier, M., Hotho, A.: Computing Semantic Relatedness from Human Navigational Paths: A Case Study on Wikipedia. IJSWIS. 9, 41--70 (2013).
    BibTeX EndNote Download URL
  • Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed Representations of Words and Phrases and their Compositionality. NIPS. p. 3111--3119. Curran Associates, Inc. (2013).
    BibTeX EndNote URL
  • West, R., Pineau, J., Precup, D.: Wikispeedia: an online game for inferring semantic distances between concepts. IJCAI. p. 1598--1603. Morgan Kaufmann Publishers Inc., Pasadena, California, USA (2009).
    BibTeX EndNote URL
  • Gabrilovich, E., Markovitch, S.: Computing semantic relatedness using Wikipedia-based explicit semantic analysis. IJCAI. p. 1606--1611. Morgan Kaufmann Publishers Inc., Hyderabad, India (2007).
    BibTeX EndNote URL