[box type=”note” align=”” class=”” width=””]The following excerpt is taken from the book Python Social Media Analytics, co-authored by Siddhartha Chatterjee and Michal Krystyanczuk.[/box]
Our article explains how to effectively perform different operations using MongoDB and Python to effectively access and modify the data.
According to the official MongoDB page:
MongoDB is free and open-source, distributed database which allows ad hoc queries, indexing, and real time aggregation to access and analyze your data. It is published under the GNU Affero General Public License and stores data in flexible, JSON-like documents, meaning fields can vary from document to document and the data structure can be changed over time.
Along with ease of use, MongoDB is recognized for the following advantages:
One links data within collections using specific identifiers. The document model is quite useful in this subject as most social media APIs provide their data in JSON format.
You can access information on the implementation of Sharding in the official documentation of MongoDB: https://docs.mongodb.com/v3.0/sharding/
MongoDB can be downloaded and installed from the following link: http://www.mongodb.org/downloads?_ga=1.253005644.410512988.1432811016.
MongoDB requires a data directory to store all the data. The directory can be created in your working directory:
md datadb
We need to go to the folder where mongod.exe is stored and and run the following command:
cmd binmongod.exe
Once the MongoDB server is running in the background, we can switch to our Python environment to connect and start working.
MongoDB can be used directly from the shell command or through programming languages. For the sake of our book we’ll explain how it works using Python. MongoDB is accessed using Python through a driver module named PyMongo.
We will not go into the detailed usage of MongoDB, which is beyond the scope of this book. We will see the most common functionalities required for analysis projects. We highly recommend reading the official MongoDB documentation.
PyMongo can be installed using the following command:
pip install pymongo
Then the following command imports it in the Python script
from pymongo import MongoClient
client = MongoClient('localhost:27017')
The database structure of MongoDB is similar to SQL languages, where you have databases, and inside databases you have tables. In MongoDB you have databases, and inside them you have collections. Collections are where you store the data, and databases store multiple collections. As MongoDB is a NoSQL database, your tables do not need to have a predefined structure, you can add documents of any composition as long as they are a JSON object. But by convention is it best practice to have a common general structure for documents in the same collections.
To access a database named scrapper
we simply have to do the following:
db_scrapper = db.scrapper
To access a collection named articles
in the database scrapper
we do this:
db_scrapper = db.scrapper
collection_articles = db_scrapper.articles
Once you have the client object initiated you can access all the databases and the collections very easily.
Now, we will see how to perform different operations:
docs = []
for _ in range(0, 10):
# each document must be of the python type dict
docs.append({
"author": "...",
"content": "...",
"comment": ["...", ... ]
})
Inserting all the docs at once:
db.collection.insert_many(docs)
Or you can insert them one by one:
for doc in docs:
db.collection.insert_one(doc)
You can find more detailed documentation at: https://docs.mongodb.com/v3.2/tutorial/insert-documents/.
# as the find function returns a cursor we will iterate over the cursor to actually fetch
# the data from the database
docs = [d for d in db.collection.find()]
To fetch all documents in batches of 100 documents:
batch_size = 100
Iteration = 0
count = db.collection.count() # getting the total number
of documents in the collection
while iteration * batch_size < count:
docs = [d for d in db.collection.find().skip(batch_size *
iteration).limit(batch_size)]
Iteration += 1
To fetch documents using search queries, where the author is Jean Francois
:
query = {'author': 'Jean Francois'}
docs = [d for d in db.collection.find(query)
Where the author field exists and is not null:
query = {'author': {'$exists': True, '$ne': None}}
docs = [d for d in db.collection.find(query)]
There are many other different filtering methods that provide a wide variety of flexibility and precision; we highly recommend taking your time going through the different search operators.
You can find more detailed documentation at: https://docs.mongodb.com/v3.2/reference/method/db.collection.find/
Jean Francois
and set the attribute published as True:query_search = {'author': 'Jean Francois'}
query_update = {'$set': {'published': True}}
db.collection.update_many(query_search, query_update)
Or you can update just the first matching document:
db.collection.update_one(query_search, query_update)
Find more detailed documentation at: https://docs.mongodb.com/v3.2/reference/method/db.collection.update/
Jean Francois
:query_search = {'author': 'Jean Francois'}
db.collection.delete_many(query_search, query_update)
Or remove the first matching document:
db.collection.delete_one(query_search, query_update)
Find more detailed documentation at: https://docs.mongodb.com/v3.2/tutorial/remove-documents/
db.collection.drop()
Or you can drop the whole database:
db.dropDatabase()
We saw how to store and access data from MongoDB. MongoDB has gained a lot of popularity and is the preferred database choice for many, especially when it comes to working with social media data.
If you found our post to be useful, do make sure to check out Python Social Media Analytics, which contains useful tips and tricks on leveraging the power of Python for effective data analysis from various social media sites such as YouTube, GitHub, Twitter etc.
I remember deciding to pursue my first IT certification, the CompTIA A+. I had signed…
Key takeaways The transformer architecture has proved to be revolutionary in outperforming the classical RNN…
Once we learn how to deploy an Ubuntu server, how to manage users, and how…
Key-takeaways: Clean code isn’t just a nice thing to have or a luxury in software projects; it's a necessity. If we…
While developing a web application, or setting dynamic pages and meta tags we need to deal with…
Software architecture is one of the most discussed topics in the software industry today, and…