DataScienceVSEM NoSQL DataBases
DataScienceVSEM NoSQL DataBases
No SQL Databases
Lab Manual
Prepared by
S.Jayanth Reddy
Objective: The main objective of this lab is to become familiar with the four NoSQL
databases:
2. Practice CRUD (Create, Read, Update, and Delete) operations on the four
MapReduce
6. Practice with ' MacDonald’s ' collection data for document oriented database.
Import restaurants collection and apply some queries to get specified output.
Installation of Redis:
You can run Redis on Windows 10 using Windows Subsystem for Linux(a.k.a WSL2).
WSL2 is a compatibility layer for running Linux binary executables natively on
Windows 10 and Windows Server 2019. WSL2 lets developers run a GNU/Linux
environment (that includes command-line tools, utilities, and applications) directly
on Windows.
Reboot Windows after making the change — note that you only need to do this once.
Then search for Ubuntu, or your preferred distribution of Linux, and download the
latest version.
Installing Redis is simple and straightforward. The following example works with
Ubuntu (you'll need to wait for initialization and create a login upon first use):
NOTE
The sudo command may or may not be required based on the user configuration of
your system.
$ redis-cli
127.0.0.1:6379> set user:1 "Jane"
127.0.0.1:6379> get user:1
"Jane"
NOTE
By default, Redis has 0-15 indexes for databases, you can change that number
databases NUMBER in redis.conf.
Installation of MongoDB:
Head over here and download the current version of MongoDB. Make sure you select
A. Make sure you are logged in as a user with Admin privileges. Then navigate to
your downloads folder and double click on the .msi package you just downloaded.
E. Select “Run service as Network Service user” and make a note of the data
B. Inside the data folder you just created, create another folder called db.
cd ~
C. Here, we’re going to create a file called .bash_profile using the following
command:
touch .bash_profile
D. Open the newly created .bash_profile with vim using the following command:
vim .bash_profile
G. Paste in the following code into vim, make sure your replace the 4.0 with your
F. Hit the Escape key on your keyboard to exit the insert mode. Then type
:wq!
B. Re-launch Hyper.
mongo --version
Once you’ve hit enter, you should see something like this:
This means that you have successfully installed and setup MongoDB on your local
system!
Installation of Cassandra:
This page gives you information about the Cassandra version you are going to install.
Press the ‘next’ button.
This page is about the license agreement. Mark the checkbox and press the next
button.
Step 3) Press the ‘next’ button.
The following page will be displayed asks about the installation location.
Go to windows start programs, search Cassandra CQL Shell and run the Cassandra
Shell. After running Cassandra shell, you will see the following command line
Installation of Neo4j:
Before you install Neo4j on Windows, check System Requirements to see if your
setup is suitable.
Windows service
Neo4j can also be run as a Windows service. Install the service with bin\neo4j
windows-service install, and start it with bin\neo4j start.
The available commands
for bin\neo4j are: version, help, console, start, stop, restart, status, and windows-
service.
When installing a new release of Neo4j, you must first run bin\neo4j windows-service
uninstall on any previously installed versions.
Java options
When Neo4j is installed as a service, Java options are stored in the service
configuration. Changes to these options after the service is installed will not take
effect until the service configuration is updated. For example, changing the
setting server.memory.heap.initial_size in neo4j.conf will not take effect until the
service is updated and restarted. To update the service, run bin\neo4j update-
service. Then restart the service to run it with the new configuration. To update the
service, run bin\neo4j windows-service update.
The same applies to the path to where Java is installed on the system. If the path
changes, for example when upgrading to a new version of Java, it is necessary to run
the update-service command and restart the service. Then the new Java location will
be used by the service.
Example 1. Update service example
1. Install service
bin\neo4j windows-service install
2. Change memory configuration
3. echo server.memory.heap.initial_size=8g >> conf\neo4j.conf
echo server.memory.heap.initial_size=16g >> conf\neo4j.conf
4. Update service
bin\neo4j windows-service update
5. Restart service
bin\neo4j restart
System requirements
The module file is located in the bin directory of your Neo4j installation, i.e. where
you unzipped the downloaded file. For example, if Neo4j was installed
in C:\Neo4j then the module would be imported like this:
Import-Module C:\Neo4j\bin\Neo4j-Management.psd1
This will add the module to the current session.
Once the module has been imported you can start an interactive console version of a
Neo4j Server like this:
Invoke-Neo4j console
To stop the server, issue Ctrl-C in the console window that was created by the
command.
Once the module is imported you can query the available commands like this:
Get-Command -Module Neo4j-Management
The output should be similar to the following:
CommandType Name Version Source
----------- ---- ------- ------
Function Invoke-Neo4j 5.11.0 Neo4j-Management
Function Invoke-Neo4jAdmin 5.11.0 Neo4j-Management
Function Invoke-Neo4jBackup 5.11.0 Neo4j-Management
Function Invoke-Neo4jImport 5.11.0 Neo4j-Management
Function Invoke-Neo4jShell 5.11.0 Neo4j-Management
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-data-redis</artifactId>
</dependency>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-web</artifactId>
</dependency>
<dependency>
<groupId>org.projectlombok</groupId>
<artifactId>lombok</artifactId>
<optional>true</optional>
</dependency>
@Data: Lombok annotation that provides the required constructor and getter/setter
methods.
@RedisHash: The annotation marks objects as aggregate roots to be stored in
a Redis hash.
@Id: Indicates that this is the Id field of the entity class.
@Indexed: Creates an index on Redis for the annotated field, which helps in
improvised performance during retrieval of data.
Similar to the JPA repositories, Spring boot provides built-in support for basic data
operations for Redis as well.
package com.asbnotebook.repository;
import org.springframework.data.repository.CrudRepository;
import com.asbnotebook.entity.Student;
public interface StudentRepository extends CrudRepository<Student, String> {
}
This class will have all the CRUD endpoints required for our application.
Also, notice that we have auto wired the repository instance into the controller class
and used available methods to perform the CRUD operation on our Student object.
package com.asbnotebook.controller;
import java.util.ArrayList;
import java.util.List;
import java.util.Optional;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.http.HttpStatus;
import org.springframework.http.ResponseEntity;
import org.springframework.web.bind.annotation.DeleteMapping;
import org.springframework.web.bind.annotation.GetMapping;
import org.springframework.web.bind.annotation.PathVariable;
import org.springframework.web.bind.annotation.PostMapping;
import org.springframework.web.bind.annotation.PutMapping;
import org.springframework.web.bind.annotation.RequestBody;
import org.springframework.web.bind.annotation.RestController;
import com.asbnotebook.entity.Student;
import com.asbnotebook.repository.StudentRepository;
@RestController
public class StudentController {
@Autowired
private StudentRepository studentRepository;
@PostMapping("/students")
public ResponseEntity<Student> createStudent(@RequestBody Student student) {
Student savedStudent = studentRepository.save(student);
return new ResponseEntity<>(savedStudent, HttpStatus.CREATED);
}
@PutMapping("/student/{id}")
public ResponseEntity<Student> updateStudent(@PathVariable(name = "id") String id,
@RequestBody Student student) {
Optional<Student> std = studentRepository.findById(id);
if (std.isPresent()) {
Student studentDB = std.get();
studentDB.setGrade(student.getGrade());
studentDB.setName(student.getName());
Student updatedStudent = studentRepository.save(studentDB);
return new ResponseEntity<>(updatedStudent, HttpStatus.CREATED);
}
return null;
}
@GetMapping("/students")
public ResponseEntity<List<Student>> getStudents() {
List<Student> students = new ArrayList<>();
studentRepository.findAll().forEach(students::add);
return new ResponseEntity<>(students, HttpStatus.OK);
}
@DeleteMapping("/student/{id}")
public ResponseEntity<String> deleteStudent(@PathVariable(name = "id") String id) {
studentRepository.deleteById(id);
return new ResponseEntity<>("Student with id:" + id + " deleted successfully", HttpStatus.OK);
}
}
Configuring Redis
Add the below Redis configurations to the spring Boot
applications application.properties configuration file under
the /src/main/resources/ directory.
spring.redis.host=localhost
spring.redis.port=6379
Create
Pass the student details by passing the JSON request to the POST endpoint.
Finally, we can notice that the Redis hash in the Redis server, as shown below.
Update
Send the JSON request object to the PUT method with an updated student name.
Get
We can also get all the available students in the Redis server by calling the GET
endpoint.
Delete
Pass the id to the DELETE endpoint to delete the student object, stored in the Redis
server.
As we know that we can use MongoDB for various things like building an
application (including web and mobile), or analysis of data, or an administrator of a
MongoDB database, in all these cases we need to interact with the MongoDB server
to perform certain operations like entering new data into the application, updating
data into the application, deleting data from the application, and reading the data
of the application. MongoDB provides a set of some basic but most essential
operations that will help you to easily interact with the MongoDB server and these
operations are known as CRUD operations.
Create Operations –
The create or insert operations are used to insert or add new documents in the
collection. If a collection does not exist, then it will create a new collection in the
database. You can perform, create operations using the following methods provided
by the MongoDB:
Method Description
Example 1: In this example, we are inserting details of a single student in the form
of document in the student collection using db.collection.insertOne()
method.
Read Operations –
The Read operations are used to retrieve documents from the collection, or in other
words, read operations are used to query a collection for a document. You can
perform read operation using the following method provided by the MongoDB:
Method Description
.pretty() : this method is used to decorate the result such that it is easy
to read.
Example : In this example, we are retrieving the details of students from the
student collection using db.collection.find() method.
Update Operations –
The update operations are used to update or modify the existing document in the
collection. You can perform update operations using the following methods
provided by the MongoDB:
Method Description
Example 1: In this example, we are updating the age of Sumit in the student
collection using db.collection.updateOne() method.
Example 2: In this example, we are updating the year of course in all the
documents in the student collection using db.collection.updateMany() method.
Delete Operations –
The delete operation are used to delete or remove the documents from a collection.
You can perform delete operations using the following methods provided by the
MongoDB:
Method Description
Example 2: In this example, we are deleting all the documents from the student
collection using db.collection.deleteMany() method.
A user can insert data into the table using Cassandra CRUD operation. The data is
stored in the columns of a row in the table. Using INSERT command with proper
what, a user can perform this operation.
Create Operation-
(<column1>,<column2>....)
VALUES (<value1>,<value2>...)
USING<option>
Let’s create a table data to illustrate the operation. Example consist of a table with
information about students in college. The following table will give the details about
the students.
Electrical
001 Ayush 9999999999 Boston
Engineering
Computer
002 Aarav 8888888888 New York City
Engineering
Electrical
001 Ayush 9999999999 Boston
Engineering
Computer
002 Aarav 8888888888 New York City
Engineering
Update Operation
The second operation in the Cassandra CRUD operation is the UPDATE operation. A
user can use UPDATE command for the operation. This operation uses three
keywords while updating the table.
Where: This keyword will specify the location where data is to be updated.
Set: This keyword will specify the updated value.
Must: This keyword includes the columns composing the primary key.
Furthermore, at the time of updating the rows, if a row is unavailable, then
Cassandra has a feature to create a fresh row for the same.
<column name>=<value>...
WHERE <condition>
EXAMPLE 2: Let’s change few details in the table ‘student’. In this example, we will
update Aarav’s city from ‘New York City’ to ‘San Fransisco’.
INPUT:
cqlsh:keyspace1> UPDATE student SET city='San Fransisco'
WHERE en=002;
Electrical
001 Ayush 9999999999 Boston
Engineering
Computer
002 Aarav 8888888888 San Fransisco
Engineering
Read Operation
This is the third Cassandra CRUD Operation – Read Operation. A user has a choice
to read either the whole table or a single column. To read data from a table, a user
can use SELECT clause. This command is also used for verifying the table after every
operation.
INPUT:
cqlsh:keyspace1> SELECT * FROM student;
Table.4 Cassandra Crud Operation – OUTPUT After Verification
Electrical
001 Ayush 9999999999 Boston
Engineering
Computer
002 Aarav 8888888888 San Fransisco
Engineering
NAME CITY
Ayush Boston
Kabir Philadelphia
Delete Operation
Delete operation is the last Cassandra CRUD Operation, allows a user to delete data
from a table. The user can use DELETE command for this operation.
A Syntax of Delete Operation-
Electrical
001 Ayush 9999999999 Boston
Engineering
Computer
002 Aarav 8888888888 San Fransisco
Engineering
EXAMPLE 6: In the ‘student’ table, let us delete the entire third row.
cqlsh:keyspace1> DELETE FROM student WHERE en=003;
Electrical
001 Ayush 9999999999 Boston
Engineering
Computer
002 Aarav 8888888888 San Fransisco
Engineering
SAVE : To store the data in neo4j we need to use create a statement of neo4j which
creates the specified labeled node in the database and set the respective property in
the node, eg if we need to create the node which stores the movie data then we
need to use the label movie or as desired and set parameter in indexed order.
session
.run("CREATE (movie:Movie{name:{name},releaseDate:{releaseDate}}) RETURN
movie",{name:"MY MOVIE",releaseDate:"22-04-18"})
.then(function (result) {
console.log(result);
})
.catch(function (error) {
console.log(error);
});
GET:: To get the data from neo4j using bolt driver we need to use GET statement of
neo4j which takes the name of the label to search the specified labeled nodes in the
database and return the respective properties of the nodes.
session
.run("MATCH (movie:Movie) RETURN movie")
.then(function (result) {
console.log(result);
})
.catch(function (error) {
console.log(error);
});
Note if we don't mention the label then it will return all the nodes of the database.
UPDATE: To update any nodes in the database we need to use MATCH statement
of neo4j and set the respective properties in the node.
session
.run('MATCH (movie:Movie) where id(movie)={id} set movie.name={name} return
movie', { id:121,name:"MOVIE-2" })
.then(function (result) {
console.log(result);
})
.catch(function (error) {
console.log(error);
});
Note: The searching of the nodes will be faster when we mention the label of the
node.
DELETE: To delete any nodes in the database we need to use the DELETE
statement of the neo4j. If the nodes are attached to any other node with a
relationship then we must need to detach the node from a relationship before delete.
session
.run('MATCH (movie:Movie) where id(movie)={id} detach delete movie', { id: 121
})
.then(function (result) {
console.log(result);
})
.catch(function (error) {
console.log(error);
});
The MongoDB $where operator is used to match documents that satisfy a JavaScript
expression. A string containing a JavaScript expression or a JavaScript function can
be pass using the $where operator. The JavaScript expression or function may be
referred as this or obj.
Our database name is 'myinfo' and our collection name is 'table3'. Here,
is the collection bellow.
If we want to select all documents from the collection "table3" which satisfying the
condition -
N.B. find() method displays the documents in a non structured format but to display
the results in a formatted way, the pretty() method can be used.
Output:
{
"_id" : ObjectId("52873b364038253faa4bbc0e"),
"student_id" : "STU002",
"sem" : "sem1",
"english" : "A",
"maths" : "A+",
"science" : "A"
}
{
"_id" : ObjectId("52873b7e4038253faa4bbc10"),
"student_id" : "STU003",
"sem" : "sem1",
"english" : "A+",
"maths" : "A",
"science" : "A+"
}
If we want to get the above output the other mongodb statements can be written as
below -
MongoDB provides different types of logical query operators and $and operator is
one of them. This operator is used to perform logical AND operation on the array of
one or more expressions and select or retrieve only those documents that match all
the given expression in the array. You can use this operator in methods like find(),
update(), etc. according to your requirements.
This operator performs short-circuit evaluation.
If the first expression of $and operator evaluates to false, then MongoDB will
not evaluate the remaining expressions in the array.
You can also use AND operation implicitly with the help of comma(, ).
You can use AND operation explicitly (i.e., $and) when the same field or
operator specified in multiple expressions.
Syntax:
{ $and: [ { Expression1 }, { Expression2 }, ..., { ExpressionN } ] }
or
{ { Expression1 }, { Expression2 }, ..., { ExpressionN }}
MongoDB provides different types of logical query operators and $or operator is
one of them. This operator is used to perform logical OR operation on the array of
two or more expressions and select or retrieve only those documents that match at
least one of the given expression in the array.
You can use this operator in methods like find(), update(), etc. according to
your requirements.
You can also use this operator with text queries, GeoSpatial queries, and sort
operations.
When MongoDB evaluating the clauses in the $or expression, it performs a
collection scan. Or if all the clauses in $or expression are supported by
indexes, then MongoDB performs index scans.
You can also nest $or operation.
Syntax:
{ $or: [ { Expression1 }, { Expression2 }, ..., { ExpressionN } ] }
In the following examples, we are working with:
The sort() method specifies the order in which the query returns the matching
documents from the given collection. You must apply this method to the cursor
before retrieving any documents from the database. It takes a document as a
parameter that contains a field: value pair that defines the sort order of the result
set. The value is 1 or -1 specifying an ascending or descending sort respectively.
If a sort returns the same result every time we perform on same data, then
such type of sort is known as a stable sort.
If a sort returns a different result every time we perform on same data, then
such type of sort is known as unstable sort.
MongoDB generally performs a stable sort unless sorting on a field that
holds duplicate values.
We can use limit() method with sort() method, it will return first m
documents, where m is the given limit.
MongoDB can find the result of the sort operation using indexes.
If MongoDB does not find sort order using index scanning, then it uses top-k
sort algorithm.
Syntax:
db.Collection_Name.sort({field_name:1 or -1})
Parameter:
The parameter contains a field: value pair that defines the sort order of the result
set. The value is 1 or -1 that specifies an ascending or descending sort respectively.
The type of parameter is a document.
Return:
It returns the documents in sorted order.
Examples:
In the following examples, we are working with:
Syntax:
cursor.limit()
Or
db.collectionName.find(<query>).limit(<number>)
Examples:
In the following examples, we are working with:
Indexing in MongoDB
Indexing in MongoDB :
MongoDB uses indexing in order to make the query processing more efficient. If
there is no indexing, then the MongoDB must scan every document in the collection
and retrieve only those documents that match the query. Indexes are special data
structures that stores some information related to the documents such that it
becomes easy for MongoDB to find the right data file. The indexes are order by the
value of the field specified in the index.
Creating an Index :
MongoDB provides a method called createIndex() that allows user to create an
index.
Syntax –
db.COLLECTION_NAME.createIndex({KEY:1})
The key determines the field on the basis of which you want to create an index and 1
(or -1) determines the order in which these indexes will be arranged(ascending or
descending).
Example –
db.mycol.createIndex({“age”:1})
{
“createdCollectionAutomatically” : false,
“numIndexesBefore” : 1,
“numIndexesAfter” : 2,
“ok” : 1
}
The createIndex() method also has a number of optional parameters.
These include:
background (Boolean)
unique (Boolean)
name (string)
sparse (Boolean)
expireAfterSeconds (integer)
hidden (Boolean)
storageEngine (Document)
Drop an index:
In order to drop an index, MongoDB provides the dropIndex() method.
Syntax –
db.NAME_OF_COLLECTION.dropIndex({KEY:1})
The dropIndex() methods can only delete one index at a time. In order to delete (or
drop) multiple indexes from the collection, MongoDB provides the dropIndexes()
method that takes multiple indexes as its parameters.
Syntax –
db.NAME_OF_COLLECTION.getIndexes()
It will retrieve all the description of the indexes created within the collection.
db.students.find().sort({"studentAge":1,"studentName":1}).pretty()
Here we are taking the sorting functionality based on “studentAge” followed by
“studentName” fields and hence in the below image, though there are 2 documents
matching for “studentAge = 25”, as studentName is an additional value given, as a
second document, studentName with value “Geek40” is displayed and after that
only, as a third document, studentName with value “GeeksForGeeksbest” is
displayed. Hence, sometimes there will be a need to create compound indexes when
we want to have a closer level of filtration.
3. Multikey Index: MongoDB uses the multikey indexes to index the values
stored in arrays. When we index a field that holds an array value then MongoDB
automatically creates a separate index of each and every value present in that array.
Using these multikey indexes we can easily find a document that contains an array
by matching the items. In MongoDB, you don’t need to explicitly specify the
multikey index because MongoDB automatically determines whether to create a
multikey index if the indexed field contains an array value.
Syntax:
db.<collection>.createIndex( { <field>: <type>} )
Here, the value of the field is 1(for ascending order) or -1(for descending order).
Example:
In the students collection, we have three documents that contains array fields.
5. Text Index: MongoDB supports query operations that perform a text search of
string content. Text index allows us to find the string content in the specified
collection. It can include any field that contains string content or an array of string
items. A collection can contain at most one text index. You are allowed to use text
index in the compound index.
Syntax:
db.<collection>.createIndex( { <field>: “text”} )
We can give exact phrases also for searching by enclosing the search terms in
double quotes
db.<collectionname>.find( { $text: { $search: “\”<Exact search term>\”” } } )
As here enclosed in double quotes, the search results contain only exact searched
data.
In case, if we want to exclude a few texts in our search term, then we can do as
db.<collectionname>.find( { $text: { $search: “<search terms> -<not required
search terms>” } } )
Prepending a – character makes the search text to get ignored and the rest of the
text is considered.
In the text search, the results are available in unsorted order. To make it available
in sorted order of relevance score, $meta textScore field is needed and sort on it.
Example:
db.singers.find(
{ $text: { $search: "Annisten" } },
{ score: { $meta: "textScore" } }
).sort( { score: { $meta: "textScore" } } )
Example:
In accessories collection we create text index:
db.accessories.createIndex({name: "text", description: "text"})
6. Hash Index: To maintain the entries with hashes of the values of the indexed
field(mostly _id field in all collections), we use Hash Index. This kind of index is
mainly required in the even distribution of data via sharding. Hashed keys are
helpful to partition the data across the sharded cluster.
Syntax:
db.<collection>.createIndex( { _id: “hashed” } )
From Version 4.4 onwards, the compound Hashed Index is applicable
7. Wildcard Index: MongoDB supports creating indexes either on a field or set of
fields and if the set of fields are mentioned, it is called as Wildcard Index.
Generally, the wildcard index does not include _id field but if you what to include
_id field in the wildcard index then you have to define it explicitly. MongoDB
allows you to create multiple wildcard indexes in the given collection. Wildcard
indexes support queries for unknown or arbitrary fields.
Syntax:
To create a wild card index on the specified field:
db.<collection>.createIndex( { “field.$**”:1 } )
To create a wild card index on all the field:
db.<collection>.createIndex( { “$**”:1 } )
To create a wild card index on multiple specified fields:
db.<collection>.createIndex(
{ “$**”:1 },
{“wildcardProjection”:
{“field1”: 1, “field2”:2}
})
Example:
In book collection we create the wildcard index:
Aggregation in MongoDB
In MongoDB, aggregation operations process the data records/documents and
return computed results. It collects values from various documents and groups
them together and then performs different types of operations on that grouped data
like sum, average, minimum, maximum, etc to return a computed result. It is
similar to the aggregate function of SQL.
MongoDB provides three ways to perform aggregation
Aggregation pipeline
Map-reduce function
Single-purpose aggregation
Aggregation pipeline
In MongoDB, the aggregation pipeline consists of stages and each stage transforms
the document. Or in other words, the aggregation pipeline is a multi-stage pipeline,
so in each state, the documents taken as input and produce the resultant set of
documents now in the next stage(id available) the resultant documents taken as
input and produce output, this process is going on till the last stage. The basic
pipeline stages provide filters that will perform like queries and the document
transformation modifies the resultant document and the other pipeline provides
tools for grouping and sorting documents. You can also use the aggregation
pipeline in sharded collection.
Let us discuss the aggregation pipeline with the help of an example:
In the above example of a collection of train fares in the first stage. Here, the
$match stage filters the documents by the value in class field i.e. class: “first-class”
and passes the document to the second stage. In the Second Stage, the $group stage
groups the documents by the id field to calculate the sum of fare for each unique id.
Here, the aggregate() function is used to perform aggregation it can have three
operators stages, expression and accumulator.
In this example, for taking a count of the number of students in section B we first
filter the documents using the $match operator, and then we use
the $count accumulator to count the total number of documents that are passed
after filtering from the $match.
Map Reduce
Map reduce is used for aggregating results for the large volume of data. Map reduce
has two main functions one is a map that groups all the documents and the second
one is the reduce which performs operation on the grouped data.
Syntax:
db.collectionName.mapReduce(mappingFunction, reduceFunction,
{out:'Result'});
Example:
In the following example, we are working with:
it performs operations over it. After performing reduction the results are stored in a
collection here in this case the collection is Results.
Example
The word count program is like the "Hello World" program in MapReduce.
A MapReduce job usually splits the input data-set into independent chunks which
are processed by the map tasks in a completely parallel manner. The framework sorts
the outputs of the maps, which are then input to the reduce tasks. Typically both the
input and the output of the job are stored in a file-system. The framework takes care
of scheduling tasks, monitoring them and re-executes the failed tasks.
WordCount example reads text files and counts how often words occur. The input is
text files and the output is text files, each line of which contains a word and the count
of how often it occured, separated by a tab.
Each mapper takes a line as input and breaks it into words. It then emits a key/value
pair of the word and each reducer sums the counts for each word and emits a single
key/value with the word and sum.
As an optimization, the reducer is also used as a combiner on the map outputs. This
reduces the amount of data sent across the network by combining each word into a
single record.
package org.myorg;
import java.io.IOException;
import java.util.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
public void map(LongWritable key, Text value, Context context) throws IOException,
InterruptedException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
context.write(word, one);
}
}
}
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
job.waitForCompletion(true);
}
All of the files in the input directory (called in-dir in the command line above) are
read and the counts of words in the input are written to the output directory (called
out-dir above). It is assumed that both inputs and outputs are stored in HDFS.If your
input is not already in HDFS, but is rather in a local file system somewhere, you need
to copy the data into HDFS using a command like this:
bin/hadoop dfs -mkdir <hdfs-dir> //not required in hadoop 0.17.2 and later
bin/hadoop dfs -copyFromLocal <local-dir> <hdfs-dir>
mapper.py
import sys
for line in sys.stdin:
# remove leading and trailing whitespace
line = line.strip()
# split the line into words
words = line.split()
# increase counters
for word in words:
print '%s\t%s' % (word, 1)
reducer.py
import sys
current_word = None
current_count = 0
word = None
for line in sys.stdin:
# remove leading and trailing whitespaces
line = line.strip()
# parse the input we got from mapper.py
word, count = line.split('\t', 1)
# convert count (currently a string) to int
try:
count = int(count)
except ValueError:
# count was not a number, so silently
# ignore/discard this line
continue
if current_word == word:
current_count += count
else:
if current_word:
print '%s\t%s' % (current_word, current_count)
current_count = count
current_word = word
if current_word == word:
print '%s\t%s' % (current_word, current_count)
The above program can be run using cat filename.txt | python mapper.py
| sort -k1,1 | python reducer.py
{
"address": {
"building": "1007",
"coord": [ -73.856077, 40.848447 ],
"street": "Morris Park Ave",
"zipcode": "10462"
},
"borough": "Bronx",
"cuisine": "Bakery",
"grades": [
{ "date": { "$date": 1393804800000 }, "grade": "A", "score": 2 },
{ "date": { "$date": 1378857600000 }, "grade": "A", "score": 6 },
{ "date": { "$date": 1358985600000 }, "grade": "A", "score": 10 },
{ "date": { "$date": 1322006400000 }, "grade": "A", "score": 9 },
{ "date": { "$date": 1299715200000 }, "grade": "B", "score": 14 }
],
"name": "Morris Park Bake Shop",
"restaurant_id": "30075445"
}
1. Write a MongoDB query to display all the documents in the collection
restaurants.
Query:
db.restaurants.find();
Query:
Query:
db.restaurants.find({},{"restaurant_id" : 1,"name":1,"borough":1,"cuisine"
:1,"_id":0});
Query:
db.restaurants.find({"borough": "Bronx"});
Query:
db.restaurants.find({"borough": "Bronx"}).skip(5).limit(5);
Query: