Why Learn Japanese: The Role of Japanese Media

Why Learn Japanese: The Role of Japanese Media


(Research done by our member, Mahdi Austian) Our goal is to survey how prevalent anime culture is as the motivation for Japanese learners. We mined 50 reddit threads from r/LearnJapanese subreddit pertaining to reasons why people study Japanese. Afterward, we conducted exploratory analysis to quantify the presence of anime-related keywords in the top 10 comments of the threads. Such words were mentioned in 40% of the comments indicating that Japanese media and entertainment comprehension is a key factor for students learning Japanese.

Methods & Data

Data Collection

Using googlesearch-python package, we scraped google search results related to the query “reason learn japanese site:reddit.com”. From these results, we manually filtered out unrelated threads and narrowed down the number from 228 to 50.  And with the threads, we used praw (a reddit API wrapper) to collect each of the top 10 original comments. If less than 10 comments, we took all the comments that were available.

A sample of raw data from praw

Data Preprocessing

Before creating the corpus to analyze, we preprocessed the data by performing the following filters for the text:

  • Removing URLs
  • Removing Emails
  • Removing Non-Alphanumeric Characters

Remove Stopwords

To avoid dilution from irrelevant English words, we used the nltk stopwords list plus the additional set of  stopwords below. We removed all the stopwords from our dataset before the analysis.

["wanted", "really", "learn", "language", "japanese", "want", "reason", "why", 'interest', 'learning', 'year', 'started', 'english', "English", "love", "now", 'reddit', 'www', 'http', 'https', 'learnjapanese', 'get', 'got', 'also', 'like', 'since', 'though', 'comment', 'com', 'time', 'know', 'motivation', 'hope', 'year', 'study', 'people', 'would', 'think', 'thing', 'never', 'could', 'studying',  'one', 'day', 'ago', 'new', 'motivate', 'language', 'something', 'interested', 'awesome', 'good']


Lemmatization considers the context and converts the word to its standardized base form (Stanford NLP Group, n.d.) . This allows for focus on the meaning of the words, preventing reduced frequency due their multiple variations.


  • am, are, is => be
  • car, cars, car's, cars' => car
our code for lemmatization


List of Keywords

We used this list of keywords to determine whether a comment involves Japanese media and entertainment as a learning motivation as well as other categories

  • To enjoy Japanese media: [anime, manga, game, novel, otaku, weaboo, weeb, music, media]
  • For travel in Japan: [travel, culture, food]
  • To live in Japan: [work, live, dating, marry]
  • To communicate with closed ones: [relationship, friends, wife, girlfriend, husband, boyfriend, mother, father, parents, grandparents, grandfather, grandmother, relatives]


N-grams are a continuous sequence of N-words. We wanted to investigate the top most popular combination of words to capture further insights on the motivations to learn Japanese.

Example of N-Grams:

  1. San Francisco (is a 2-gram/bigram)
  2. The Three Musketeers (is a 3-gram/trigram)
  3. She stood up slowly (is a 4-gram)


Each comment need to contain at least one category keyword
Number of times top 30 Uni-grams show up in the corpus
Number of times top 30 Bi-grams show up in the corpus
Number of times top 30 Tri-grams show up in the corpus


Of the 432 comments in the reddit threads, 204 of them contained at least one of the keywords related to Japanese media and entertainment. This is more than double the other common reasons like studying Japanese to travel in Japan, to live in Japan, or to communicate with closed ones in Japanese. Particularly, the most common word combinations in the bi- and tri-gram bar charts support the high frequency we observed. In conclusion, Japanese media like anime can be identified as a significant reason for why people learn Japanese. Further exploration can be directed into figuring if similar motivations exist learning other languages.


Andybywire. (2020). NLP Text Analysis. NLP Text Analysis. https://github.com/andybywire/nlp-text-analysis/blob/master/text-analytics.ipynb

Stanford NLP Group. (n.d.). Stemming and lemmatization. Stanford NLP Group. Retrieved September 3, 2022, from https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html

50 Threads from Reddit: https://drive.google.com/file/d/1CLjYtEHWgcgJRq889YukrL-S1fDrtqa0/view?usp=sharing