How to store and access social media data in MongoDB

5 min read

[box type=”note” align=”” class=”” width=””]The following excerpt is taken from the book Python Social Media Analytics, co-authored by Siddhartha Chatterjee and Michal Krystyanczuk.[/box]

Our article explains how to effectively perform different operations using MongoDB and Python to effectively access and modify the data.

According to the official MongoDB page:

MongoDB is free and open-source, distributed database which allows ad hoc queries, indexing, and real time aggregation to access and analyze your data. It is published under the GNU Affero General Public License and stores data in flexible, JSON-like documents, meaning fields can vary from document to document and the data structure can be changed over time.

Along with ease of use, MongoDB is recognized for the following advantages:

Schema-less design: Unlike traditional relational databases, which require the data to fit its schema, MongoDB provides a flexible schema-less data model. The data model is based on documents and collections. A document is essentially a JSON structure and a collection is a group of documents.

One links data within collections using specific identifiers. The document model is quite useful in this subject as most social media APIs provide their data in JSON format.

High performance: Indexing and GRIDFS features of MongoDB provide fast access and storage.
High availability: Duplication feature that allows us to make various copies of databases in different nodes confirms high availability in the case of node failures.
Automatic scaling: The Sharding feature of MongoDB scales large data sets Automatically.

You can access information on the implementation of Sharding in the official documentation of MongoDB: https://docs.mongodb.com/v3.0/sharding/

Installing MongoDB

MongoDB can be downloaded and installed from the following link: http://www.mongodb.org/downloads?_ga=1.253005644.410512988.1432811016.

Setting up the environment

MongoDB requires a data directory to store all the data. The directory can be created in your working directory:

md datadb

Starting MongoDB

We need to go to the folder where mongod.exe is stored and and run the following command:

cmd binmongod.exe

Once the MongoDB server is running in the background, we can switch to our Python environment to connect and start working.

MongoDB using Python

MongoDB can be used directly from the shell command or through programming languages. For the sake of our book we’ll explain how it works using Python. MongoDB is accessed using Python through a driver module named PyMongo.

We will not go into the detailed usage of MongoDB, which is beyond the scope of this book. We will see the most common functionalities required for analysis projects. We highly recommend reading the official MongoDB documentation.

PyMongo can be installed using the following command:

pip install pymongo

Then the following command imports it in the Python script

from pymongo import MongoClient

client = MongoClient('localhost:27017')

The database structure of MongoDB is similar to SQL languages, where you have databases, and inside databases you have tables. In MongoDB you have databases, and inside them you have collections. Collections are where you store the data, and databases store multiple collections. As MongoDB is a NoSQL database, your tables do not need to have a predefined structure, you can add documents of any composition as long as they are a JSON object. But by convention is it best practice to have a common general structure for documents in the same collections.

To access a database named scrapper we simply have to do the following:

db_scrapper = db.scrapper

To access a collection named articles in the database scrapper we do this:

db_scrapper = db.scrapper

collection_articles = db_scrapper.articles

Once you have the client object initiated you can access all the databases and the collections very easily.

Now, we will see how to perform different operations:

Insert: To insert a document into a collection we build a list of new documents to insert into the database:

docs = []

for _ in range(0, 10):

# each document must be of the python type dict

docs.append({

"author": "...",

"content": "...",

"comment": ["...", ... ]

})

Inserting all the docs at once:

db.collection.insert_many(docs)

Or you can insert them one by one:

for doc in docs:

db.collection.insert_one(doc)

You can find more detailed documentation at: https://docs.mongodb.com/v3.2/tutorial/insert-documents/.

Find: To fetch all documents within a collection:

# as the find function returns a cursor we will iterate over the cursor to actually fetch

# the data from the database

docs = [d for d in db.collection.find()]

To fetch all documents in batches of 100 documents:

batch_size = 100

Iteration = 0

count = db.collection.count() # getting the total number

of documents in the collection

while iteration * batch_size < count:

docs = [d for d in db.collection.find().skip(batch_size *

iteration).limit(batch_size)]

Iteration += 1

To fetch documents using search queries, where the author is Jean Francois:

query = {'author': 'Jean Francois'}

docs = [d for d in db.collection.find(query)

Where the author field exists and is not null:

query = {'author': {'$exists': True, '$ne': None}}

docs = [d for d in db.collection.find(query)]

There are many other different filtering methods that provide a wide variety of flexibility and precision; we highly recommend taking your time going through the different search operators.

You can find more detailed documentation at: https://docs.mongodb.com/v3.2/reference/method/db.collection.find/

Update: To update a document where the author is Jean Francois and set the attribute published as True:

query_search = {'author': 'Jean Francois'}

query_update = {'$set': {'published': True}}

db.collection.update_many(query_search, query_update)

Or you can update just the first matching document:

db.collection.update_one(query_search, query_update)

Find more detailed documentation at: https://docs.mongodb.com/v3.2/reference/method/db.collection.update/

Remove: Remove all documents where the author is Jean Francois:

query_search = {'author': 'Jean Francois'}

db.collection.delete_many(query_search, query_update)

Or remove the first matching document:

db.collection.delete_one(query_search, query_update)

Find more detailed documentation at: https://docs.mongodb.com/v3.2/tutorial/remove-documents/

Drop: You can drop collections by the following:

db.collection.drop()

Or you can drop the whole database:

db.dropDatabase()

We saw how to store and access data from MongoDB. MongoDB has gained a lot of popularity and is the preferred database choice for many, especially when it comes to working with social media data.

If you found our post to be useful, do make sure to check out Python Social Media Analytics, which contains useful tips and tricks on leveraging the power of Python for effective data analysis from various social media sites such as YouTube, GitHub, Twitter etc.

Amey Varangaonkar

Data Science Enthusiast. A massive science fiction and Manchester United fan. Loves to read, write and listen to music.