How to effectively clean social media data for analysis

[box type="note" align="" class="" width=""]This article is a book extract from Python Social Media Analytics, written by Siddhartha Chatterjee and Michal Krystyanczuk.[/box]

Data cleaning and preprocessing is an essential - and often crucial - part of any analytical process. In this excerpt, we explain the different techniques and mechanisms for effective analysis of your social media data.

Social media contains different types of data: information about user profiles, statistics
(number of likes or number of followers), verbatims, and other media content. Quantitative data is very convenient for an analysis using statistical and numerical methods, but unstructured data
such as user comments is much more challenging. To get meaningful information, one has to perform the whole process of information retrieval. It starts with the definition of the
data type and data structure. On social media, unstructured data is related to text, images,
videos, and sound and we will mostly deal with textual data. Then, the data has to be
cleaned and normalized. Only after all these steps can we delve into the analysis.

Social media Data type and encoding

Comments and conversation are textual data that we retrieve as strings. In brief, a string is a sequence of characters represented by code points. Every string in Python is seen as a Unicode covering the numbers from 0 through 0x10FFFF (1,114,111 decimal). Then, the sequence has to be represented as a set of bytes (values from 0 to 255) in memory. The rules for translating a Unicode string into a sequence of bytes are called encoding.

Encoding plays a very important role in natural language processing because people use more and more characters such as emojis or emoticons, which replace whole words and express emotions. Moreover, in many languages, there are accents that go beyond the regular English alphabet. In order to deal with all the processing problems that might be caused by these, we have to use the right encoding, because comparing two strings with different encodings is actually like comparing apples and oranges. The most common one is UTF-8, used by default in Python 3, which can handle any type of character. As a rule of thumb always normalize your data to Unicode UTF-8.

Structure of social media data

Another question we'll encounter is, What is the right structure for our data? The most natural choice is a list that can store a sequence of data points (verbatims, numbers, and so on). However, the use of lists will not be efficient on large datasets and we'll be constrained to use sequential processing of the data. That is why a much better solution is to store the data in a tabular format in pandas dataframe, which has multiple advantages for further processing. First of all, rows are indexed, so search operations become much faster. There are also many optimized methods for different kinds of processing and above all it allows you to optimize your own processing by using functional programming. Moreover, a row can contain multiple fields with metadata about verbatims, which are very often used in our analysis.

It is worth remembering that the dataset in pandas must fit into RAM memory. For bigger datasets, we suggest the use of SFrames.

Pre-processing and text normalization

Preprocessing is one of the most important parts of the analysis process. It reformats the unstructured data into uniform, standardized form. The characters, words, and sentences identified at this stage are the fundamental units passed to all further processing stages. The quality of the preprocessing has a big impact of the final result on the whole process.

There are several stages of the process: from simple text cleaning by removing white spaces, punctuation, HTML tags and special characters up to more sophisticated normalization techniques such as tokenization, stemming or lemmatization. In general, the main aim is to keep all the characters and words that are important for the analysis and, at the same time, get rid of all others, and the text corpus should be maintained in one uniform format.

We import all necessary libraries.

import re, itertools

import nltk

from nltk.corpus import stopwords

When dealing with raw text, we usually have a set of words including many details we are not interested in, such as whitespace, line breaks, and blank lines. Moreover, many words contain capital letters so programming languages misinterpret for example, "go" and "Go" as two different words. In order to handle such distinctions, we can convert all words to lowercase format with the following steps:

Perform basic text mining cleaning.
Remove all whitespaces:

verbatim = verbatim.strip()

Many text processing tasks can be done via pattern matching. We can find words containing a character and replace it with another one or just remove it. Regular expressions give us a powerful and flexible method for describing the character patterns we are interested in. They are commonly used in cleaning punctuation, HTML tags, and URLs paths.

3. Remove punctuation:

verbatim = re.sub(r'[^ws]','',verbatim)

4. Remove HTML tags:

verbatim = re.sub('<[^<]+?>', '', verbatim)

5. Remove URLs:

verbatim = re.sub(r'^https?://.*[rn]*', '', verbatim,

flags=re.MULTILINE)

Depending on the quality of the text corpus, sometimes there is a need to implement some corrections. This refers to the text sources such as Twitter or forums, where emotions can play a role and the comments contain multiple letters words for example, "happpppy" instead of "happy"

6. Standardize words (remove multiple letters):

verbatim = ''.join(''.join(s)[:2] for _, s in

itertools.groupby(verbatim))

After removal of punctuation or white spaces, words can be attached. This happens especially when deleting the periods at the end of the sentences. The corpus might look like: "the brown dog is lostEverybody is looking for him". So there is a need to split "lostEverybody" into two separate words.

7. Split attached words:

verbatim = " ".join(re.findall('[A-Z][^A-Z]*', verbatim))

Stop words are basically a set of commonly used words in any language: mainly determiners, prepositions, and coordinating conjunctions. By removing the words that are very commonly used in a given language, we can focus only on the important words instead, and improve the accuracy of the text processing.

8. Convert text to lowercase, lower():

verbatim = verbatim.lower()

9. Stop word removal:

verbatim = ' '.join([word for word in verbatim.split() if word not in (stopwords.words('english'))])

10. Stemming and lemmatization: The main aim of stemming and lemmatization is to reduce inflectional forms and sometimes derivationally related forms of a word to a common base form. Stemming reduces word forms to so-called stems, whereas lemmatization reduces word forms to linguistically valid lemmas.

Some examples of stemming are cars -> car, men -> man, and went -> Go
Such text processing can give added value in some domains, and may improve the accuracy of practical information extraction tasks

Tokenization: Tokenization is the process of breaking a text corpus up into words (most commonly), phrases, or other meaningful elements, which are then called tokens. The tokens become the basic units for further text processing.

tokens = nltk.word_tokenize(verbatim)

Other techniques are spelling correction, domain knowledge, and grammar checking.

Duplicate removal

Depending on data source we might notice multiple duplicates in our dataset. The decision to remove duplicates should be based on the understanding of the domain. In most cases, duplicates come from errors in data collection process and it is recommended to remove them in order to reduce bias in our analysis, with the help of the following:

df = df.drop_duplicates(subset=['column_name'])

Knowing basic text cleaning techniques, we can now learn how to store the data in an efficient way. For this purpose, we will explain how to use one of the most convenient NoSQL databases—MongoDB.

Capture: Once you have made a connection to your API you need to make a special request and receive the data at your end. This step requires you go through the data to be able to understand it. Often the data is received in a special format called JavaScript Object Notation (JSON). JSON was created to enable a lightweight data interchange between programs. The JSON resembles the old XML format and consists of a key-value pair.
Normalization: The data received from platforms are not in an ideal format to perform analysis. With textual data there are many different approaches to normalization. One can be stripping whitespaces surrounding verbatims, or converting all verbatims to lowercase, or changing the encoding to UTF-8. The point is that if we do not maintain a standard protocol for normalization, we will introduce many unintended errors. The goal of normalization is to transform all your data in a consistent manner that ensures a uniform standardization of your data.

It is recommended that you create wrapper functions for your normalization techniques, and then apply these wrappers on all your data input points so as to ensure that all the data in your analysis go through exactly the same normalization process. In general, one should always perform the following cleaning steps:

Normalize the textual content: Normalization generally contains at least the following steps:

Stripping surrounding whitespaces.
Lowercasing the verbatim.
Universal encoding (UTF-8).

2. Remove special characters (example: punctuation).

3. Remove stop words: Irrespective of the language stop words add no additional informative value to the analysis, except in the case of deep parsing where stop words can be bridge connectors between targeted words.

4. Splitting attached words.

5. Removal of URLs and hyperlinks: URLs and hyperlinks can be studied separately, but due to the lack of grammatical structure they are by convention removed from verbatims.

6. Slang lookups: This is a relatively difficult task, because here we would require a predefined vocabulary of slang words and their proper reference words, for example: luv maps to love. Such dictionaries are available on the open web, but there is always a risk of them being outdated. In the case of studying words and not phrases (or n-grams), it is very important to do the following:

Tokenize verbatim
Stemming and lemmatization (Optional): This is where different written forms of the same word do not hold additional meaning to your study

Some advanced cleaning procedures are:

Grammar checking: Grammar checking is mostly learning-based, a huge amount of proper text data is learned, and models are created for the purpose of grammar correction. There are many online tools that are available for grammar correction purposes. This is a very tricky cleaning technique because language style and structure can change from source to source (for example language on Twitter will not correspond with the language from published books). Wrongly correcting grammar can have negative effects on the analysis.

Spelling correction: In natural language, misspelled errors are encountered. Companies, such as Google and Microsoft have achieved a decent accuracy level in automated spell correction. One can use algorithms such as the Levenshtein Distances, Dictionary Lookup, and so on, or other modules and packages to fix these errors. Again take spell correction with a grain of salt, because false positives can affect the results.

Storing: Once the data is received, normalized, and/or cleaned, we need to store the data in an efficient storage database. In this book we have chosen MongoDB as the database as it's a modern and scalable database. It's also relatively easy to use and get started. However, other databases such as Cassandra or HBase could also be used depending on expertise and objectives.

Data cleaning and preprocessing, although tedious, can simplify your data analysis work. With the effective Python packages like Numpy, SciPy, Pandas etc these tasks become so much easy and save a lot of your time.

If you found this piece of information useful, make sure to check out our book Python Social Media Analytics, which will help you draw actionable insights from mining social media portals such as GitHub, Twitter, YouTube, and more!