Utilisateur:Nathalie.Bcht/D2SN 2020/Mémoire

Séance 20/01/20:

JD: extrait d'au moins 20,000 tweets mentionnant un des hashtag relatif à la grève de la RATP puis créer un deuxième jeu de données contenant uniquement les émoji.

QS: Que dit l'utilisation intensive et croissante des emoji de notre relation à l'écriture/lecture et à la communication sociale? Comment les communautés s'approprient, voire, déforment la signification initiale des emojis à leurs intérêts personnels? Communiquons-nous mieux ou moins bien, ou est-ce juste différent?

MA: application des méthodes digitales des data science (scrapping, Cortext) & d'une sémiologie (socio-linguistique) et analyse anthropologique d'un "langage cryptographique digital".


Analysis of thousands of tweets focusing on the use and occurrences of emoji:

Proposition of a statistical and semiological approach to reading the growing refinement and problematic this digital pictographic language imposes to our contemporary society.

Introduction to the subject modifier

While studying emoji recurrence and popularity on a large corpus of Tweets, what can we observe?

From what experts seem to announce, emoji is an ever growing and used language throughout all devices and on all digital platforms, but not only. Emojis also seem to have generated a culture of their own, substantial enough to have launched merchandizing (emoji.com) and even a movie (The Emoji Movie, 2017) therefore becoming also, a whole new brand.

As any form of culture and language, it has obviously been appropriated, shifted from its initial purpose, community correlated. Especially when it comes to politics, criminality and sex.

The eggplant emoji and the peach emoji became respectively symbols for male genitals and female buttocks. On the People v. Jamerson, 2019 case implying sex trafficking, the pimp persistently used the crown emoji as a symbol all allegiance from his partners[1]. And finally for instance, the Catalan secessionists came visible and vocal on tweeter using the yellow ribbon emoji, picking this color for their movement as opposed to the red color of the unionists. Tweeter became like an expansion of what was already going on in the region (people tying yellow or red ribbon up in the streets, at their balcony, in sign of support): independence used yellow ribbons🎗in their tweets (initially cancer support symbol) and the opposition used the Spanish flag ES[2].

What are emojis? Sometimes written as singular or plural “Emoji” (also, “emoji” and “Emojis”) it comes from the Japanese “e” meaning image and “moji” meaning letters. They can be considered the descendants from the Unicode block (U+2600–U+26FF) miscellaneous symbols (ex: J, (, Q, «, *, etc.). Emojis were originally created for Android and iOS mobile phones texts, even though they’re now used throughout all social media and all platforms for every type of conversation. They are pictographic colorful representations of various subjects (faces, hand gestures, hearts, flags, food, outfits, animals, weather conditions, sports, etc.) and we now count over 1,800 of them.

Therefore we would like to ask ourselves, what does the intensive and growing use of emoji nowadays says about our contemporary relation to literacy and social communication? How do communities appropriate and even distort initial meaning of emojis to their personal interests? Do we communicate better or worse, or is it just different?

In order to answer those large scaled sociological questions, we need to proceed to a data driven research borrowing methods from data sciences.

Methodology: modifier

Data gathering modifier

Prior to answer any social question we need to gather data and apply text mining and other data sciences inspired methodologies. Tweeter happens to be a handy and relevant social media when it comes to analyzing texts: because it is a text centered social media (compared to Instagram, Pinterest or Flickr that work as galleries and do not provide much space for writing), because it has a defined number of characters and then limits the complexity and length of text mining processes, and finally because Tweeter provides a fairly easy access to the API for scrapping tweets.

The protocol idea would be 1) either to obtain a public dataset online or a dataset that colleagues would accept to share with me, 2) to scrap tweets myself and gather data.

The dataset will have to be as large as possible (nothing under 20,000 tweets), and be either:

-         representative of a variety of subjects and uses on Twitter (political tweets, comic tweets, media tweets, common tweets). Or,

-         focus on a particular event to figure how the emojis have been used in a particular context, if they have been twisted, appropriated, etc.

If the latter option is selected, the tweet corpus would be generated around the French strike for retirement reform that started the 5th of December 2019 throughout the country. In order to select the related tweets, we would use the following hashtag (chosen after a close exploration of the tweets around the strike):

#greveratp #grèveRATP #RATP #RER #metro #métro #greveSNCF #retraites #reformedesretraites #ReformeRetraite #bus #tram #GreveGenerale

It would concern the period starting the Thursday 5th of December until the 20th of December 2019 and could be expanded if the movement lasts in time.


The dataset needs to provide the following informations: tweets full text, the account name, the date is was posted, the number of likes, the number of comments, and the number of retweets.

Data processing modifier

Once the dataset is gathered, we would apply text mining to it. Mains tasks are:

-         Display the most recurrent emojis

-         Display the terms the emojis are the most recurrently associated with

-         Figure out if there is any correlation in between tweets using emojis and the engagement (like, comment, retweet)

-         Which types of accounts use the most emojis

-         Is the use of emoji related to a particular emotions?

We will then have to apply all the necessary steps to text mining:

-         Tokenize the tweets per words and sentences.

-         Delete the most frequent non-significant terms (determinant, pronoun, common verbs, etc.)

-         Trying embedding per word and sentences.


I don’t know yet if I will apply either a Python script or a R script. I will surely need help to gather the data and analyze it as I am not fully comfortable manipulating data science tools yet.

I am thinking about a partnership with a student from the ESIEE if they are interested with the project and if we have compatible schedules.

Relevance of the subject modifier

While doing my research about emojis, I founded out that the use of emoji has attracted researcher from different fields: linguistic, cognitive sciences, law, politics but very rarely sociologists and even less data scientists. Therefore it appears to me that the subject could very likely be treated by data scientist with a sociological glance.  Not to mention, studying emoji will imply for me great sum of research and comprehension of how Unicode blocks work.

Even though the subject of Emoji could primarily be seen as a mere entertaining news item, I have the strong feeling this digital language and what individuals made of it is representative of many ongoing human phenomenon, especially those related to the digital shift. It is relatively easy to observe communication evolves through time, human beings relationships too, and studying just how emoji had a role to play within those areas can give us clues on why it’s changed and also speculate on how it will continue to evolve.

Studying the growing use of emojis on different social media also (re)launches the great debates around technophobia and just how technologies and social media simplified designs and structures would badly affect humans IQ, human’s abilities to write, and to socialize.

I believe proceeding to a study of the use of emojis through data science and sociology is part of a greater project within digital sociology/anthropology and computational sciences to learn more about the effects of technologies on societies and its evolution to come: with a somehow strong intention of showing nothing is to fear, education of minds is just to be made.



LINK TO MY ZOTERO: https://www.zotero.org/groups/2408085/master_d2sn_2020/items/collectionKey/QA4MUQLP

  1. (en) Eric Goldman · in Emojis et Evidence/Discovery, « Two Examples of How Courts Interpret Emojis », sur Technology & Marketing Law Blog, (consulté le 17 décembre 2019)
  2. « Exploring the Emoji Divide in Catalonia », sur danielbalcells.github.io (consulté le 17 décembre 2019)