3-Mongodb and Mapreduce Programming.pdf

Mongodb and Mapreduce
Programming
Big Data Analytics using Python

Introduction – MongoDB
• MongoDB is an open-source, cross-platform, and distributed document-based
database designed for ease of application development and scaling.
• It is a NoSQL database developed by MongoDB Inc.
• MongoDB name is derived from the word "Humongous" which means huge,
enormous.
• MongoDB database is built to store a huge amount of data and also perform fast.
• MongoDB is not a Relational Database Management System (RDBMS).
• It's called a "NoSQL" database. It is opposite to SQL based databases where it
does not normalize data under schemas and tables where every table has a fixed
structure.
• Instead, it stores data in the collections as JSON based documents and does not
enforce schemas.
• It does not have tables, rows, and columns as other SQL (RDBMS) databases.

TerMongoDB and RDBMS terminologiesnolog
RDBMS Concept MongoDB Concept
Database Database
Table Collection
Tuple/Row Document
column Field
Table Join Embedded Documents
Primary Key Primary Key (Default key _id provided by mongodb itself)

Example of JSON based document.

What is Document based storage?
• A Document is nothing but a data structure with name-value pairs like in
JSON.
• It is very easy to map any custom Object of any programming language with a
MongoDB Document.
• For example : Student object has attributes name, rollno and subjects, where
subjects is a List.
{
name : “Maria",
rollno : 1,
subjects : ["C Language", "C++", "Core Java"]
}

Key Features of MongoDB
• MongoDB provides high performance Input/Output operations.
• Lesser than relational databases due to support of embedded
documents(data models)
• Select queries are also faster as Indexes in MongoDB supports faster
queries.

Overview of MongoDB
• MongoDB consists of a set of
databases.
• Each database again consists
of Collections.
• Data in MongoDB is stored
in collections.
• The below figure depicts the
typical database structure in
MongoDB.

Advantages of MongoDB
• MongoDB stores data as JSON based document that does not enforce the
schema.
• It allows us to store hierarchical data in a document. This makes it easy to store
and retrieve data in an efficient manner.
• It is easy to scale up or down as per the requirement since it is a document based
database.
• MongoDB also allows us to split data across multiple servers.
• MongoDB provides rich features like indexing, aggregation, file store, etc.
• MongoDB performs fast with huge data.
• MongoDB provides drivers to store and fetch data from different applications
developed in different technologies such as C#, Java, Python, Node.js, etc.
• MongoDB provides tools to manage MongoDB databases.

Switch or Create a new MongoDB Database
• MongoDB is a document-oriented open-source NoSQL database.
• It is one of the most popular and widely used NoSQL databases
• A database is a place where data is stored in an organized way.
• In MongoDB, databases are used to store collections.
• A single MongoDB server can have multiple databases and a single MongoDB database
can have multiple collections.
• You can use MongoDB Shell or MongoDB Compass to create a new database.
• MongoDB provides the use <database-name command to connect with the database.
• If the specified database name does not exist then it creates it and set it as a current
database.
• For example, the following command switch to the "humanResouredb" database. If it
does not exist then creates it.

use humanResourceDB
MongoDB will automatically switch to the newly created database.
Notice that it prompts to humanResourceDB> now.
To check all the databases, use the "show dbs" command, as shown below.
As you can see above, the "admin", "config", and "local" are default databases. As of now, "humanResourceDB" is not
visible. This is because there is no collection in it.
To delete a database, use the db.dropDatabase() method which deletes a current database.

MongoDB Collections
• A collection in MongoDB is similar to a table in RDBMS.
• MongoDB collections do not enforce schemas.
• Each MongoDB collection can have multiple documents.
• A document is equivalent to row in a table in RDBMS.
• To create a collection, use the db.createCollection() command.
• The following creates a new employees collection in the current
database, which is humanResourceDB database created in the
previous screenshot.

Above, the employees collection is created using the
creatCollection() method.
It returns an object { ok: 1 }, which indicates the
collection was created successfully.
As mentioned above, a single database can have
multiple collections.
The following creates multiple collections.
Use the show collections commands to list all the
collections in a database.

• To delete a collection, use the db.<collection-name>.drop() method.

MongoDB Documents: Document, Array,
Embedded Document
• In the RDBMS database, a table can have multiple rows and columns.
• Similarly in MongoDB, a collection can have multiple documents
which are equivalent to the rows.
• Each document has multiple "fields" which are equivalent to the
columns.
• So in simple terms, each MongoDB document is a record and a
collection is a table that can store multiple documents.

Example of JSON based document.
In the above example, a document is contained within the curly braces. It contains multiple fields in "field":"value"
format. Above, "_id", "firstName", and "lastName" are field names with their respective values after a colon :.
Fields are separated by a comma. A single collection can have multiple such documents separated by a comma.

The following chart to understand the relation
between database, collections, and documents.

Example of a document that contains an array
and an embedded document.
{
"_id": ObjectId("32521df3f4948bd2f54218"),
"firstName": "John",
"lastName": "King",
"email": "john.king@abc.com",
"salary": "33000",
"DoB": new Date('Mar 24, 2011'),
"skills": [ "Angular", "React", "MongoDB" ],
"address": {
"street":"Upper Street",
"house":"No 1",
"city":"New York",
"country":"USA"
}
}
MongoDB document stores data in JSON format.
In the document, "firstName", "lastName", "email", and "salary"
are the fields (like columns of a table in RDBMS) with their
corresponding values (e.g value of a column in a row).
Consider "_id" field as a primary key field that stores a unique
ObjectId.
"skills" is an array and "address" holds another JSON document.

JSON vs BSON
• MongoDB stores data in key-value pairs as a BSON document.
• BSON is a binary representation of a JSON document that supports more data types than JSON.
• MongoDB drivers convert JSON document to BSON data.

Important Points:
• MongoDB reserves _id name for use as a unique primary key field that holds ObjectId type.
However, you are free to give any name you like with any data type other than the array.
• A document field name cannot be null but the value can be.
• Most MongoDB documents cannot have duplicate field names. However, it depends on the driver
you use to store a document in your application.
• A document fields can be without quotation marks " " if it does not contain spaces, e.g. { name:
"Steve"}, { "first name": "Steve"} are valid fields.
• Use the dot notation to access array elements or embedded documents.
• MongoDB supports maximum document size of 16mb. Use GridFS to store more than 16 MB
document.
• Fields in a BSON document are ordered. It means fields order is important while comparing two
documents, e.g. {x: 1, y: 2} is not equal to {y: 2, x: 1}
• MogoDB keeps the order of the fields except _id field which is always the first field.
• MongoDB collection can store documents with different fields. It does not enforce any schema.

Embedded Documents
• A document in MongoDB can have fields that hold another document. It is also called nested documents.
• The following is an embedded document where the department and address field contains another document.
{
_id: ObjectId("32521df3f4948bd2f54218"),
firstName: "John",
lastName: "King",
department: {
name:"Finance"
},
address: {
phone: { type: "Home", number: "111-000-000" }
}
}

• In the above embedded document, notice that the address field
contains the phone field which holds a second level document.
• An embedded document can contain upto 100 levels of nesting.
• Supports a maximum size of 16 mb.
• Embedded documents can be accessed using dot notation
embedded-document.fieldname,
• e.g. access phone number using address.phone.number

Array
• A field in a document can hold array.
• Arrays can hold any type of data or embedded documents.
• Array elements in a document can be accessed using dot notation with the
zero-based index position and enclose in quotes.
{
firstName: "John",
lastName: "King",
email: "john.king@abc.com",
skills: [ "Angular", "React", "MongoDB" ],
}
The above document contains the skills field that holds an array of strings. To specify or access the second
element in the skills array, use skills.1.

Datatypes Examples
String:
Examples:
"name": "John"
"city": "New York"
Number (Integer and Double):
Integer Example:
"age": 30
Double Example:
"price": 19.99
Boolean:
Examples:
"isStudent": true
"isWorking": false
Date:
Example:
"birthDate": ISODate("1990-05-15T00:00:00Z")
ObjectId: A unique identifier for documents within a collection.
MongoDB automatically assigns an ObjectId to each document.
Example:
{
"_id": ObjectId("5f5c6d8d165bc2a3a9825ef1"),
"name": "Alice"
}
Array: Represents an ordered list of values.
Example:
{
"hobbies": ["Reading", "Swimming", "Cooking"]
}

Embedded Document: Represents a document embedded within another document.
{
"address": {
"street": "123 Main St",
"city": "Anytown",
"state": "CA"
}
}
Null: Represents the absence of a value.
{
"middleName":null
}
Regular Expression: Represents regular expressions.
{
"regexPattern":/^abc/
}
Symbol: Represents symbol data (deprecated).
{
"symbol": Symbol("sample_symbol")
Binary Data: Represents binary data in various formats (e.g.,
binary, UUID, MD5).
{
"imageData": BinData(0, "aGVsbG8=") // Base64-encoded
binary data
}
Decimal128: Represents high-precision decimal numbers.
{
"price": NumberDecimal("19.99")
}
Min Key and Max Key: Special values representing the smallest
and largest BSON elements.
{
"minValue": MinKey,
"maxValue": MaxKey
}

What is MongoDB MapReduce?
• Map-Reduce is a programming paradigm in MongoDB that enables
you to process large data sets and produce aggregated results.
• The map-reduce operations in MongoDB are performed by the
MapReduce() function.
• The map and reduce functions are the two main functions in this
function.
• It is possible to group all the data based on a key value using the map
function and perform operations on this grouped data using the
reduce function.

• The MapReduce() appears to work best on extensive collections of
data.
• With Map Reduce, you can aggregate data using key-based
operations such as max, avg, as well as a group by in SQL.
• As a result, each data set is mapped and reduced independently in
different spaces and then combined in a function, resulting in a new
collection.
• Again, data is processed independently and in parallel.

Consider the following map-reduce
operation:
• In very simple terms, the mapReduce command takes 2 primary inputs, the mapper function and the reducer
function.
• A Mapper will start off by reading a collection of data and building a Map with only the required fields we wish
to process and group them into one array based on the key.
• And then this key value pair is fed into a Reducer, which will process the values.

MongoDB MapReduce Syntax and Parameter
Syntax:
db.collection.mapReduce(
function() {emit(key, value);},
//Define map function
function(key, values) {return reduceFunction}, {
//Define reduce function
out: collection,
query: document,
sort: document,
limit: number
}
)

Parameter Explanation
• The above map-reduce function will query the collection, and then map the output documents to the emit key-
value pairs. After this, it is reduced based on the keys that have multiple values. Here, we have used the
following functions and parameters.
• Map: – It is a JavaScript function. It is used to map a value with a key and produces a key-value pair.
• Reduce: – It is a JavaScript function. It is used to reduce or group together all the documents which have the
same key.
• Out: – It is used to specify the location of the map-reduce query output.
• Query: – It is used to specify the optional selection criteria for selecting documents.
• Sort: – It is used to specify the optional sort criteria.
• Limit: – It is used to specify the optional maximum number of documents which are desired to be returned.
• Finalize: MongoDB provides this method as an optional parameter. The output will be modified, and the reduced
method will be followed.
• Scope: Using the MapReduce method, the scope specifies which variables from the map are accessible.
• JsMode: When executing functions, it specifies whether the data will be converted into BSON format.
• Verbose: By default, verbose is set to false in MapReduce commands. This specifies the timing information.
• Collation: MongoDB’s MapReduce method accepts a correlation parameter as an optional parameter. It specifies
which collation will be used during MapReduce operations

Emit key-value pairs
• Emit intermediate key-value pairs as needed can be better
understood with a simple real-time example.
• Let's consider a scenario where you have a log file containing web
access records (each log entry represents a visit to a website), and
you want to count the number of visits per web page.
You can use MapReduce to achieve this:

Map Phase:
• In the Map phase, you read each log entry, extract the web page URL visited,
and emit key-value pairs where the key is the web page URL and the value is 1
(indicating one visit).
Example Log Entries:
Log 1: User1 visited PageA
Log 2: User2 visited PageB
Log 3: User1 visited PageA
Map Function:
function map(key, logEntry) {
// Extract the web page URL from logEntry
var pageURL = extractPageURL(logEntry);
// Emit key-value pair for each page visit
emit(pageURL, 1);
}
Emitted Key-Value Pairs:
(PageA, 1)
(PageB, 1)
(PageA, 1)

Map Function:
function map(key, logEntry) {
// Extract the web page URL from logEntry
var pageURL = extractPageURL(logEntry);
// Emit key-value pair for each page visit
emit(pageURL, 1);
}

Example 1: MapReduce Function
• In this example, we have five records from which we need to take out
the maximum sales of each section, and the keys are id, sales,
amount.
• {"id":1, “sales": A, “amount":80}
• {"id":2, " sales ":A, “amount":90}
• {"id":1, " sales ":B, " amount ":99}
• {"id":1, " sales ":B, " amount ":95}
• {"id":1, " sales ":C, " amount ":90}

• Here we need to find the maximum amount in each sales.
• So, our key by which we will group documents is the sales key and the
value will be amount.
• Inside the map function, we use emit(this.sales, this.amount)
function, and we will return the sales and amount of each
record(document) from the emit function.
• This is similar to group By MySQL.
var map = function(){emit(this.sales, this.amount)};

• After iterating over each document Emit function will give back the
data like this:
{“A”:[80, 90]}, {“B”:[99, 90]}, {“C”:[90] }
and upto this point it is what map() function does.
The data given by emit function is grouped by sales key, Now this data
will be input to our reduce function.
Reduce function is where actual aggregation of data takes place.
In our example we will pick the Max of each sales like for
sales A:[80, 90] = 90 (Max) B:[99, 90] = 99 (max) , C:[90] = 90(max).

Example 2: MapReduce function
• Consider the following document structure that stores book details author wise.
• The document stores author_name of the book author and the status of book.
> db.author.save({"book_title" : "MongoDB Tutorial", "author_name" :
"aparajita", "status" : "active", "publish_year": "2016" })
> db.author.save({"book_title" : "Software Testing Tutorial", "author_name" :
"aparajita", "status" : "active", "publish_year": "2015" })
> db.author.save({"book_title" : "Node.js Tutorial", "author_name" : “Kritika",
"status" : "active", "publish_year": "2016" })
> db.author.save({"book_title" : "PHP7 Tutorial", "author_name" : "aparajita",
"status" : “passive", "publish_year": "2016" })

db.author.find()
{ "_id" : ObjectId("59333022523476d644344db9"), "book_title" : "MongoDB
Tutorial","author_name" : "aparajita", "status" : "active", "publish_year" : "2016" }
{ "_id" : ObjectId("59333031523476d644344dba"), "book_title" : "Software Testing
Tutorial", "author_name" : "aparajita", "status" : "active", "publish_year" : "2015" }
{ "_id" : ObjectId("5933303e523476d644344dbb"), "book_title" : "Node.js
Tutorial", "author_name" : "aparajita", "status" : "active", "publish_year" : "2016" }
{ "_id" : ObjectId("5933304b523476d644344dbc"), "book_title" : "PHP7 Tutorial",
"author_name" : "aparajita", "status" : "active", "publish_year" : "2016" }

Now, use the mapReduce function
• To select all the active books,
• Group them together on the basis of author_name and
• Then count the number of books by each author by using the
following code in MongoDB.
Code:
db.author.mapReduce(
function() { emit(this.author_name,1) },
function(key, values) {return Array.sum(values)},
{ query:{status:"active"}, out:"author_total" } ).find()
Out-Put
{ "_id" : "aparajita", "value" : 2 }
{ "_id" : “Kritika", "value" : 1 }

Framework Extensions
• MongoDB's MapReduce framework, you can incorporate certain
features and techniques to optimize the MapReduce process.
• Combiner
• Partitioner
• Searching
• Sorting
• Compression
are not directly part of the MapReduce framework but can be
important considerations in a MapReduce job in MongoDB.

Combiner
• In the Hadoop MapReduce model, a combiner is an optional mini-
reduce operation that can be applied locally on each mapper's output
before sending it to the reducer.
• In MongoDB's MapReduce, there is no built-in concept of a combiner
like in Hadoop.
• However, you can achieve similar effects by designing your map and
reduce functions carefully.
• For example, you can perform partial aggregation in the map phase
itself to reduce the amount of data that needs to be shuffled to the
reducer.

Example with respect to Hadoop
• Imagine you have a large dataset of sales transactions and you want to calculate the total
sales for each product.
• In Hadoop MapReduce, you'd have a map phase where each mapper processes a portion
of the data and emits key-value pairs with the product as the key and the sales amount
as the value.
• For example:
• Mapper 1: ("Product A", 100)
• Mapper 2: ("Product B", 50)
• Mapper 3: ("Product A", 75)
• Now, before sending this data to the reducer, a combiner function can run locally on each
mapper to perform a mini-reduction.
• It combines the values for each key:
• Mapper 1 (Combiner): ("Product A", 100)
• Mapper 2 (Combiner): ("Product B", 50)
• Mapper 3 (Combiner): ("Product A", 75)
This local aggregation reduces the amount of data that needs to be transferred over the network to the reducer.
The final output sent to the reducer is smaller and more efficient.

Example with respect to Mongodb
• In MongoDB's MapReduce, you don't have a built-in combiner concept like in Hadoop.
• However, you can design your map and reduce functions to achieve a similar effect by performing partial
aggregation in the map phase.
• For example: Suppose you have a MongoDB collection of sales transactions and you want to calculate the total
sales for each product.
1. Map Phase (JavaScript function): Your map function extracts the product and sales amount from each
document and emits key-value:
var mapFunction = function() {
emit(this.product, this.amount);
};
After mapping, you have intermediate data:
("Product A", 100)
("Product B", 50)
("Product A", 75)
2. Reduce Phase (JavaScript function): In your reduce function, you can perform aggregation for each key:
var reduceFunction = function(key, values) {
return Array.sum(values);
};
In this phase, MongoDB performs the final aggregation for each product.

Partitioner
• In Hadoop, a partitioner determines how the output of the mappers
is distributed to the reducers.
• MongoDB's MapReduce framework automatically handles the
partitioning of data based on the emitted keys in the map phase, so
you don't need to implement a custom partitioner.

Real-time Example for Partitioning in
MongoDB (Sharded Cluster):
• Suppose you have a sharded MongoDB cluster for storing user data,
and you want to perform real-time analytics on user data, including
counting the number of users in different age groups.
• Here's how partitioning (sharding) works.
1. Data Sharding
2. Real-time Analytics with MapReduce

Data Sharding
• You have a collection called user_data with a field age representing the age of users.
• MongoDB allows you to shard the data across multiple shards based on a shard key.
• In this case, you can shard the collection using the age field as the shard key.
• For example, you might have data distributed like this:
• Shard 1: Users with ages 18-30
• MongoDB automatically routes queries to the appropriate shard based on the shard key.
• When you perform real-time analytics on user data, the partitioning (sharding) ensures that the
data is distributed across multiple servers, improving query performance.

Real-time Analytics with MapReduce
• Now, let's say you want to perform real-time analytics to count the number of
users in each age group.
• Use MongoDB's MapReduce framework to achieve this:
• Map Function: The map function would emit key-value pairs with the age
group as the key and 1 as the value.
var mapFunction = function() {
var ageGroup;
if (this.age >= 18 && this.age <= 30) {
ageGroup = "18-30";
} else if (this.age >= 31 && this.age <= 45) {
ageGroup = "31-45";
} else {
ageGroup = "46-60";
}
emit(ageGroup, 1);
};
Reduce Function: The reduce function would sum up the values
for each age group.
var reduceFunction = function(key, values) {
return Array.sum(values);
};
Executing MapReduce: When you execute this MapReduce job,
MongoDB will distribute the map tasks across the shards based
on the shard key (age), and each shard will process its portion
of the data.
The results will be combined to give you the count of users in
each age group in real-time.

Searching
• Searching is not directly related to MapReduce but rather to querying.
• MongoDB provides a powerful query language that allows you to
search for documents within a collection using a wide range of
criteria.
• You can use MongoDB's query language to filter and select data
before applying MapReduce to it.

Example
Scenario: Imagine you have a MongoDB database storing information about
books in a library. You want to search for all books published in a specific
year.
Step 1: Connect to MongoDB
Step 2: Select the Database and Collection: Assuming you have a database
named "library" and a collection named "books," select the database and
collection: use library
Step 3: Search for Books Published in a Specific Year:
db.books.find({ "publishYear": 2022 })
Step 4: View the Results
MongoDB will return a cursor with all the documents (books) that match the
query condition. while (cursor.hasNext()) {
printjson(cursor.next());
}

Sorting
• Sorting in MongoDB allows you to order the results of a query based
on one or more fields in ascending or descending order.
Scenario: Suppose you have a MongoDB database with a collection
named "products" containing information about various products. You
want to retrieve a list of products sorted by their prices in descending
order.
Step 1: Connect to MongoDB
Step 2: Select the Database and Collection: use ecommerce
Step 3: Sort Documents by Price
db.products.find().sort({ "price": -1 }) MongoDB sort the documents based on the
"price" field in descending order (-1).

Cont. Sorting
• Step 4: View the Sorted Results
MongoDB will return a cursor with the documents (products) sorted by
price in descending order.
while (cursor.hasNext()) {
printjson(cursor.next());
}
This will display a list of products, starting with the highest-priced
product and descending to the lowest-priced product.
You can also sort documents by multiple fields and in ascending order
by changing the sorting criteria in the .sort() method.
db.products.find().sort({ "category": 1, "price": -1 })

Compression
• MongoDB itself doesn't provide native data compression like some
other database systems.
• Instead, it relies on storage engines and file systems to handle
compression at the storage level.
• Here's a simplified example of how data compression can be achieved
in MongoDB using the WiredTiger storage engine, which is the default
storage engine in MongoDB.

Example
Suppose you have a MongoDB database that stores a large collection of text
documents, and you want to enable compression to reduce storage space.
Step 1: Enable WiredTiger Compression
WiredTiger, the default storage engine in MongoDB, supports data
compression.
To enable compression, you need to set the
storage.wiredTiger.engineConfig.configString option to include the
"block_compressor" setting.
This setting specifies the compression algorithm to use.
Common options include "snappy" and "zlib."

Cont. Compression
# Connect to MongoDB
mongo
# Switch to the admin database
use admin
# Enable compression with Snappy algorithm
db.runCommand({setParameter: 1, "wiredTigerEngineRuntimeConfig":
"block_compressor=snappy"})

Cont. Compression
Step 2: Insert or Update Documents
When you insert or update documents in your collection, MongoDB will use
the specified compression algorithm for storage.
# Switch to your database and collection
use your_database
db.your_collection.insert({ "text": "This is a sample document." })
Step 3: Monitor Compression
You can monitor the storage and compression ratio using the MongoDB shell
or a database monitoring tool.
The WiredTiger statistics can provide insights into the compression
effectiveness.

Cont. Compression
# Switch to your database
use your_database
# Check collection statistics
db.your_collection.stats()
This command will provide information about the collection, including storage size,
data size, and compression details.
Please note that the level of compression and its effectiveness may vary depending
on the data and the chosen compression algorithm.
In a real-world scenario, you should carefully consider your data characteristics and
compression requirements before enabling compression.
Keep in mind that MongoDB also supports transparent data encryption (TDE) for
data-at-rest security, which can work in conjunction with compression to ensure
the security and efficiency of your data storage.

3-Mongodb and Mapreduce Programming.pdf

More Related Content

What's hot

Similar to 3-Mongodb and Mapreduce Programming.pdf

Recently uploaded

3-Mongodb and Mapreduce Programming.pdf