Publication Date

2018

Document Type

Dissertation

Committee Members

Amit Sheth (Advisor); Derek Doran, (Committee Member); Krishnaprasad Thirunarayan (Committee Member); Wenbo Wang (Committee Member)

Degree Name

Doctor of Philosophy (PhD)

Abstract

Pictographs, commonly referred to as `emoji’, have become a popular way to enhance electronic communications. They are an important component of the language used in social media. With their introduction in the late 1990’s, emoji have been widely used to enhance the sentiment, emotion, and sarcasm expressed in social media messages. They are equally popular across many social media sites including Facebook, Instagram, and Twitter. In 2015, Instagram reported that nearly half of the photo comments posted on Instagram contain emoji, and in the same year, Twitter reported that the `face with tears of joy’ emoji has been tweeted 6.6 billion times. As of 2017, Facebook and Facebook Messenger processed over 60 million and 6 billion messages with emoji per day, respectively. Emogi, an Internet marketing firm, reports that over 92% of all online users have used emoji at least once. Creators of the SwiftKey Keyboard for mobile devices report that they process 6 billion messages per day that contain emoji. Moreover, business organizations have adopted and now accept the use of emoji in professional communication. For example, Appboy, an Internet marketing company, reports that there has been a 777% year-over-year increase and 20% month-over-month increase in emoji usage for marketing campaigns by business organizations in 2016. These statistics leave little doubt that emoji are a significant and important aspect of electronic communication across the world. The ability to automatically process and interpret text fused with emoji will be essential as society embraces emoji as a standard form of online communication. In the same way that natural language is processed with sophisticated machine learning techniques and technologies for many important applications, including text similarity and word sense disambiguation, emoji should also be amenable to such analysis. Yet the pictorial nature of emoji, the fact that the same emoji may be used in different contexts to express different meanings, and that emoji are used in different cultures over the world which can interpret emoji differently, make it especially difficult to apply traditional Natural Language Processing (NLP) techniques to analyze them. Indeed, emoji were developed organically with no overt/explicit semantics assigned to them. This contributed to their flexible usage but also lead to ambiguity. Thus, similar to words, emoji can take on different meanings depending on context and part-of-speech (POS). Polysemy in emoji complicates determination of emoji similarity and emoji sense disambiguation. However, having access to machine-readable sense repositories that are specifically designed to capture emoji meaning can play a vital role in representing, contextually disambiguating, and converting pictorial forms of emoji into text, thereby leveraging and generalizing NLP techniques for processing richer medium of communication. This dissertation presents the creation of EmojiNet, the largest machine-readable emoji sense inventory that links Unicode emoji representations to their English meanings extracted from the Web. EmojiNet consists of (i) 12,904 sense labels over 2,389 emoji, which were extracted from reliable online web sources and linked to machine-readable sense definitions seen in BabelNet; (ii) context words associated with each emoji sense, which are inferred through word embedding models trained over Google News and Twitter message corpora for each emoji sense definition; and (iii) recognizing discrepancies in the presentation of emoji on different platforms and specification of the most likely platform-based emoji sense for a selected set of emoji. It then discusses the application of emoji meanings extracted from EmojiNet to solve novel downstream applications including emoji similarity and emoji sense disambiguation. To address the problem of emoji similarity, first, it presents a comprehensive analysis of the semantic similarity of emoji through emoji embedding models learned over emoji meanings in EmojiNet. Using emoji descriptions, emoji sense labels, and emoji sense definitions, and with different training corpora obtained from Twitter and Google News, multiple embedding models are learned to measure emoji similarity. Using a benchmark sentiment analysis dataset, it further shows that incorporating emoji meanings in EmojiNet into embedding models can improve the accuracy of sentiment analysis tasks by ~9%. To address the problem of emoji sense disambiguation, it uses word embedding models learned over Twitter and Google News corpora and shows that word embeddings models can be used to improve the accuracy of emoji sense disambiguation tasks. The EmojiNet framework, its RESTful web services, and other benchmarking datasets created as part of this dissertation are publicly released at http://emojinet.knoesis.org/.

Page Count

111

Department or Program

Computer Science and Engineering PhD

Year Degree Awarded

2018


Share

COinS