Wikipedia, a platform that needs no introduction, is a staple of the internet. Anyone somewhat interested in a topic can instantly search for it and read an in-depth article about it with sources included. Often, people may find themselves in a so-called “Wiki-rabbit hole” where they read a page and continuously click on different links, leading them to more related information. However, a user needs to click on a link that is directly linked on the Wikipedia page. What if a Wikipedia article could list a variety of related pages that are not directly listed by the page?
The goal of this project was to give the user a list of related Wikipedia pages that are not necessarily directly related through links. This would enable the user to click through related subjects without the topic even being directly mentioned by the article. This will be implemented by using NLP to determine the most similar articles to every article within the entire English Wikipedia.
The entire English Wikipedia is available for download from the Wikipedia website. Wiki dumps are made available periodically, and we can start by choosing the February 1st link. We can then use Beautiful Soup and the get_file Keras function to search for and download all available compressed, text-only files partitioned by the Wiki website, which ends up as around 16 GB of data. We could also download all articles within a single download, but downloading around 60 partitioned files allows us to parallelize the cleaning process and greatly decrease run-time.
A single partitioned file consists of a large number of compressed Wikipedia articles in XML format. We clean each file by using bzcat to iteratively decompress each line, an XML parser to flag when we hit the end of each article, and a cleaning function within our parser to clean, stem, and tokenize the article as it’s passed.
for line in subprocess.Popen(['bzcat'],
stdin = open(data_path),
stdout = subprocess.PIPE).stdout:
For the engine to be computationally efficient with the amount of data processed, we also limit the text we look at to only the introduction of Wikipedia articles, which should provide a good approximation of an article’s content. Cleaning all partitions using Python’s multiprocessing library takes around 3 hours on 6 cores rather versus an estimated 20 hours without parallelization. After cleaning, we combine all tokenized partitions into a single file, and we can start analyzing the data!
Now that we have this data in a usable format, our goal is to convert this data to a TF-IDF matrix and find the most similar articles by looking for the shortest cosine distance between the TF-IDF word frequency of all articles. The simplest way to do this is to first convert our dataset to Gensim’s Bag of Words and Dictionary format, and use Gensim’s built-in functions to examine this data. Essentially, the Gensim Dictionary maps every unique word to a unique ID, and the Bag of Words converts every tokenized article to a list of IDs.
We will then use the TfidfModel implementation given by Gensim to convert the Bag of Words to provide word frequency with basic term frequency, inverse document collection frequency, and cosine article-length normalization to account for articles with shorter introductions. To not hold the full Bag of Words in RAM, we need to stream it from Memory using Gensim’s MmCorpus.
corpus = corpora.mmcorpus.MmCorpus(bowpath)dict = corpora.dictionary.Dictionary.load(dictpath)tfidf = models.TfidfModel(tqdm(corpus),smartirs='ntc')
From this TF-IDF matrix, we need to find the cosine similarity between each row. For this, we will initialize a Gensim Similarity Matrix that will take some queries, calculate the cosine distance between the query and all articles, and will sort in order to find the N smallest distances from the article excluding the article itself.
sim = similarities.docsim.Similarity(shard_directory,
With this Similarity Matrix in hand, I deploy a Streamlit app that takes in an article’s title, looks up the title’s ID using a lookup table, and provides to the user a list of the N closest articles.
With this Streamlit app built, I plan to deploy it to Heroku to allow users to easily access recommendations for a Wikipedia article. Prior to deploying, I plan to improve the computational time it takes in order to access these recommendations by storing the recommendations in memory rather than doing a cosine similarity lookup through the Similarity function.
After the app is available, I also plan to explore working with recommendation systems other than cosine similarity such as LDA and Word2Vec.
This project was originally completed as a project for the Metis Bootcamp. You can find all code for this project on my Github, and please feel free to reach out to me with any questions on my LinkedIn.