KEMBAR78
Large Scale and MultiStructured Databases | PDF | Apache Hadoop | Databases
0% found this document useful (0 votes)
369 views223 pages

Large Scale and MultiStructured Databases

Appunti Large Scale and MultiStructured Databases - Artificial Intelligence and Data Engineering Unipi

Uploaded by

Franco Terranova
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
369 views223 pages

Large Scale and MultiStructured Databases

Appunti Large Scale and MultiStructured Databases - Artificial Intelligence and Data Engineering Unipi

Uploaded by

Franco Terranova
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 223

Large Scale and Multi-Structured Databases

There are different sources of big data, starting from social media and networks, scientific
instruments, mobile devices, transactions, sensor technology and networks, and genomics.
These are the 5 most important Vs on Big Data:
- Volume: huge amount of data
- Variety: different formats of data from various sources
- Value: extract useful data
- Velocity: high speed of accumulation of data
- Veracity: inconsistencies and uncertainty in data

We have other Vs of Big Data, additional features:


- Validity: data quality
- Variability: which consists in dynamicity
- Venue: Distributed Heterogeneous data from multiple platforms
- Vocabulary: semantics that describes data structures
- Vagueness: confusion over meaning of big data and tools used

We have to take care of different aspects: storage infrastructure, computation infrastructure,


databases/querying, analytics/mining, security and privacy, data acquisition and visualization.
Big data refers to any problem characteristic that represents a challenge to process it with
traditional applications.
Big data involves data whose volume, diversity and complexity requires new techniques,
algorithms and analyses to fetch, to store and elaborate them.
New technologies are needed for both storage and computational tasks, New paradigms are
needed for these problems, like distributed systems.
To deal with data intensive applications we used to scale-up, moving to servers with bigger
capabilities, but this is too much expensive.
Another possibility was to scale-out, adding more servers in parallel so to build clusters.
Distributed systems in big data has to have like objective: applying an operation to all data.
One machine cannot process or store all data, but data is distributed in a cluster of computing
nodes.
It does not matter which machine executes the operation, and if it is run twice in different nodes.
We look for an abstraction of the complexity behind distributed systems.
Data locality is crucial, we need to avoid data transfer between machines as much as possible.
We need to let computers work as much as possible with its data or near data, splitting the data in
the right way or replicating data In a smart way.
The idea is to replicate data in different places, and let computers work in parallel. There will be
low computer intensity and limited communication, the only important thing is the result.
We have store data on local disks of the nodes that perform computations on that data (data
locality).
A new programming model is MapReduce: “Moving computation is cheaper than moving
computation and data at the same time”.
The idea is that data is distributed among nodes (distributed file system) and functions/operations
to process data are distributed to all the computing nodes.
Each computing node works with the data stored in it. One
A mapper is a program that works on a chunk of data, executed on a computer unit.
The reducer is a program executed in one or more computer unit that aggregate the results
produced by mappers.
It uses the divide et impera paradigm.
divide: partition dataset into smaller, independent chunks to be
processed in parallel (map)
conquer: combine, merge or otherwise aggregate the results from the previous step (reduce)

Input data is divided in logical splits, the mapper running on a PC will produce results and these
results will be re-organized in order to be re-elaborated by the reducer, that will aggregate them
and produce the final result:

The mapper takes as input couples of key/value and produces couples of key/value then all the
key values resulted by different mappers are sorted to generate key/value pairs where the value is
a list of values, each of them is elaborated by a specific reducer, that produces an output.
In this example the input key is the word and the value is the count, all key/value couples are
shuffled and sorted and the reducer counts the number of elements and then produces a result.
We can see the physical infrastructure of the Google Software Architecture for Big Data.

The distributed file system was the most important part and they used the map-reduce
architecture working with the framework Hadoop and BigTable.

Hadoop is an open-source (the first) framework of Google written in Java used for the distributed
storage of very large data sets (Big Data) and the distributed processing of very large data sets.
This framework consists of a number of modules:
- Hadoop Common: the common utilities that support the other Hadoop
modules.
- Hadoop Distributed File System (HDFS)
- Hadoop YARN – A framework for job scheduling and cluster resource management
- Hadoop MapReduce – programming model
The HDFS – is a distributed File System written in Java, it scales to clusters with thousands of
computing nodes.
Each node stores part of the data in the system and it is fault tolerant due to data replication.
It is designed for big files and low-cost hardware and efficient for read and append operations.
We have some limits, for example Hadoop is optimized for one-pass batch processing of on-disk
data. In interactions we have to push back the data from memory to the file system. A loop of
several map-reduce stages requires more operations with memory.
In iterative programs it is very slow.

Apache Spark is an open-source which has emerged as the next generation big data processing
tool due to its enhanced flexibility and efficiency.
Spark allows employing different distributed programming models, such as MapReduce and
Pregel, and has proved to perform faster than Hadoop, especially in case of iterative and online
applications.
Unlike the disk-based MapReduce paradigm supported by Hadoop, Spark employs the concept of
in-memory cluster computing, where datasets are cached in memory to reduce their access
latency.
At high level, a Spark application runs as a set of independent processes on the top of the dataset
distributed across the machines of the cluster and consists of one driver program and several
executors.
The driver program, hosted in the master machine, runs the user’s main function and distributes
operations on the cluster by sending several units of work, called tasks, to the executors.
Each executor, hosted in a slave machine, runs tasks in parallel and keeps data in memory or disk
storage across them.

The main abstraction provided by Spark is the resilient distributed dataset (RDD).
RDD is a fault-tolerant collection of elements partitioned across the machines of the cluster that
can be processed in parallel. These collections are resilient, because they can be rebuilt if a portion
of the dataset is lost.
The applications developed using the Spark framework are totally independent of the file system
or the database management system used for storing data.

Databases revolution
A database is an organized collection of data.
It can be a book with a enforced structure such as dictionaries or a library and other indexed
arcvhives of information. They can be seen as the first databases.
In the 19th century we had a mechanical database, loom card for creating complex fabric patterns.
We needed to store what we wanted to have in our fabric.
Another kind of databases was perforated paper strips for musical boxes, strips contained
information.
We also had punched cards used in tabulating machine for producing census statistics.
The zero revolution was the tape revolution. Data were stored sequentially adopting paper tape or
magnetic tape.
Features was ‘fast forward’ and ‘rewind’ through the datasets.
But the problem was no direct high-speed access to individual records.
The first electronic database, OLTP (Online Transaction Processing) appeared around 1950.
Data was stored in the file system, we had a database but without a DBMS, they had a ISAM
system that had records containing indexes.
A DBMS is a software in charge of managing the DB in term of querying, consistence and other
rules we have to follow when work with data.
We had no application in the data layer and all the control of the database was in the hand of the.
application. Every application had to reinvent the database wheel, with problems like errors in
handling code, corrupting data, concurrent access, duplication in each application.
The first database revolution was when we have a new layer that allowed the separation between
the database handling logic and the application itself, with DBMS.
It reduced the programmer overhead and increased the performance and integrity of data.
The DBMS need two elements to work correctly:
- a scheme for the data to store in the database
- an access path for navigating among data records
The navigation model has to do with the organization of the data.

In the hierarchical model the only way to get information was to start from a product.
We had to define the navigational model keeping in mind what queries we have to do.
The network model can be better in some cases.
If you start from a product all the time, the hierarchical model can be okay.
Problems of the Navigation Models were that they ran exclusively on the mainframe computers, in
only queries that could be anticipated during the initial design were possible, a hierarchical model
you can just start from the product and you cannot start from a department for example, and also
it was difficult to add new data. Also, complex analytic queries requires hard coding.
The second database revolution started few years later.
The demand for complex analytics were increasing, existing databases were hard to use and
existing databases mixed logical and physical implementations.
In 1970 a paper was published talking about the Relational theory with the concept of tuples, an
unordered set of attribute values, and relations, collections of distinct tuples.
There were also constraints to enforce consistency of the database and key constraints used to
identify tuples and relationships between tuples.
Operations on relations, such as joins, projections and unions, were introduced too, together with
the third normal form, that says that non-key attributes must be dependent only on the key.
Then, the concept of transactions was introduced too, which is a concurrent data change requests
on a database system.
In transactions we need to ensure consistency and integrity of data, and a solution are ACID
transactions.
A transaction is a transformation of the state which has the properties of atomicity (all or
nothing), durability (survive failures) and consistency (correct).
An ACID transaction should be:
- atomic, the transaction is indivisible.
- consistent, the DB remains in a consistent state before and after the execution of the
transaction.
- isolated, while multiple transactions can be executed simultaneously, one transaction
should not see the effects of other in-progress transactions.
- durable, once a transaction is saved to the DB, its changes are expected to persist even if
there is a failure of operating system or hardware.

If the transaction fails after completion of T1 but before completion of T2 we have an inconsistent
state.
Then the SQL language appeared and was adopted. as a standard query language and also, a war
started for DBs between IBM and ORACLE.
In the client-server paradigm the relational database had a lot of importance.
Database systems were hosted on remote servers, while presentation logic was hosted on a PC
terminal. Objected-oriented application had problems with relational databases.
Multiple SQL operations were required to convert from the object-oriented representation to the
relational representation.
A new DBMS model, namely Object Oriented Database Management System (OODBMS), allowed
to store program objects without normalization and applications were able to load and store their
object easily.
Object-Relational Mapping (ORM) frameworks alleviated the pain of OO programmers.
Internet started to pervade our lives and the era of massive web-scale applications created
pressures on the relational database model. They needed large volumes of read and write
operations, low latency response times and high availability.
The performance of RDBM may be improved with vertical scaling (upgrading CPUs, adding
memory and faster storage devices) but it is costly and there are limitations on the number of
CPUs and memory.
But in any case, it would still be difficult to modify the structure of existing data, difficulties in
deploying on multiple servers, performance issues and ACID transactions hard to implement on
multiple servers.
The third database revolution started with the needs of large-scalable DBMSs characterized by:
- Scalability: for example, a spike of traffic on a website.
We could add a new server and then shut it down when traffic become normal but hard to
adopt with RDBMS.
- Flexibility: for example, application which handles several kind of items.
We could add/remove/modify one or mode fields which describe items in the database.
But to modify the schemes of RDBMs is not an easy task and involves the modification also
of the application code.
We then move to a Schema-less model from a Schema model.
- Availability: for example websites such as Social Networks or e-commerce services
must be always available for users.
A solution could be to deploy the whole systems on multiple servers. If one server goes
down, the services continue to be available for users, even though performances (and also
data consistency) may be reduced.
Usually RDMSs run on a single server. If it fails, the services will not be available unless
there is a backup server. However, this solutions are not efficient because the second
server will not improve the performance of the DBMSs during its normal workflow.
We can’t exploit the replication and sharding and do a load balancing.
- Low cost: most of the RDBMSs used by enterprises and organizations have often licensing
costs. These costs depends on the number of users.
A solution might be to use open source DBMSs.
The issue is that enterprises and organizations should change their database architectures
and even re-design entirely their applications.

The third revolution led us to NoSQL Databases.


NoSQL is the acronym of Not Only SQL. NoSQL considers a set of data models and related
software.
Most of NoSQL solutions ensures Scalability, Availability, Flexibility and are often open source.
ACID transactions may be not supported by NoSQL databases.

So, to recap:
Currently, scalable services for handling relational databases are offered by big enterprises such as
Google and Amazon. Using these services, enterprises may continue to use their software without
the needs to migrate towards NoSQL databases. These services are cloud-based and pay-per-use
services.
The offered solutions are often scalable up to thousands of nodes and ensure high availability of
the service, but still the flexibility is reduced as in classical SQL-based DBMSs.
Everything is also managed on the cloud and also a spending manager is needed for taking the
costs under control.

Recap on Software Engineering


The steps of the workflow to develop, test and deploy one application include these steps:
1) Requirements elicitation from customers, which will regard some interviews between the
customer and the software engineer.
2) Requirements definition, both functional and non functional
3) Use case definition (better if with diagrams), to formalize single activities that will be in our
application.
4) Identification of the main data structures (including main procedure) and relations among
them (Analysis Workflow/Data Modelling)
5) Refinement of Analysis classes in step 4 with the Project Workflow, in which we add more
details to classes. Then we will move towards the DB design.
6) Implementation and Test
7) Deploy
This workflow is not currently used in a strict way, most enterprises today use the agile approach.
We need to identify the software/hardware architecture, programming languages and
development environments, database management systems and everything must be driven by
requirements, common sense and experience.

Functional requirements describe the main functionalities of the application and modules and they
depend on the type of software, expected users and the type of system where the software is
used.
Functional requirements may be high-level statements of what the system should do.
Functional requirements can define if/then behaviours.
EX. The user shall be able to search either all of the initial set of databases or select a subset from
it. The system shall provide appropriate viewers for the user to read documents in the document
store.
Specifying some fields in these requirements is not an error, but all of them yes.

Non-functional requirements define system properties and constraints (e.g. reliability, response
time and storage requirements).
Ex. At least concurrently 100 years, at least it should have xMB/s in download, access should be
regarded only to authorized users
They may regards also programming language or development method. Non-functional
requirements may be more critical than functional requirements. If these are not met, the system
is useless.
Ex. In this application we decided to use C#.
A categorization of non functional requirements can be this one.
- Product requirements: requirements which specify that the delivered product must behave
in a particular way (execution speed, reliability, ecc.)
- Organizational requirements: consequences of organizational policies and procedures
- External requirements: which arise from factors which are external to the system and its
development process (e.g. interoperability and legislative) requirements

Use cases represent a formal way of describing how a system interact with its environment. A use
case illustrates the activities that are performed by the users of the systems.
Use cases are sequence of actions a system performs that yields a valuable result for a particular
actor.
This is a scenario-based technique in UML.
You can add mookup at the use case diagram.
Use case diagrams describe what a system does from the standpoint of an external observer, the
emphasis is on what a system does rather than how.
They are closely connected to scenarios, which are example of what happens when someone
interacts with the application.

A user or outside system that interacts with the system being designed, in order to obtain some
value from that interaction is an actor.
Use cases describe the interaction between the actor and the system.
It can be a human, a peripheral device, external system or subsystem, time or time-based event.
A base use case is dependent on the included use case. Without it/them the base use case is
incomplete as the included use case represent sub-sequences of the interaction that must always
happen.

Whenever we make an ATM transaction, we also have to run the Customer Authentication action.
Otherwise, it cannot be executed. For example, Customer Authentication -> Authentication Check.
The extending use case is dependent on the base use case; it extends the behavior described by
the base use case. The base use case should be a fully functional use case without the extending
use case’s adding functionality.

You can’t make help without doing the Bank ATM Transaction. It is an additional feature that you
have.

Data models provide a structure and formal description of the data and data relationships among
them required for an information system.
EX. Entities: Company, Customer, Departments
Relations: Company has a collection of departments
Entity Sets and Relationship Sets do not depend on the type of Data Base and DBMS.
The overall data modelling process includes three stages:
- Data analysis
- Designing a conceptual data model (entity relationship model)
- Converting the conceptual data model into an actual DB data organization model
We will model this using UML analysis classes.

A department has a collection of employes. The order in which they’re added is the reverse of the
E/R diagram.
We have a generalization, we can assume that employee is an abstract class, which can not be
instantiated, and from it we can obtain some generalized classes.
In a relational model we would need different tables.

In the graph-based model we have vertices which are entities and edges with some properties.
In particular, they are records, and not general entities.
We found these actors: Manager
Selected it is specified to make understand that first a department has to be selected.
We resemble the manager as a man.

We imagine to have a view in the application which is the list of departments.


It has a use case which is browse departments.
Another use case is find department, and I would expect details from that department, that I can
draw next to the list of departments.
We need also the view Department use case, and it will be included by the find department use
case.
In browse departments we would find for all the departments in the database, and show the view
of all departments, so we can implement it with an include.
The delete department is an extension of the view because we have first to select it, so we have to
go to the view of that department.
For our course we will avoid to add delete attached to the manager because it’s really high-level.
Also modify the department is an extension.
If you have an extension it’s an additive feature, you can run it or not, in difference of the inclusion

Add can’t be an extension of browse, you don’t care about browsing, it’s not important to run the
browsing. Users do not browse before adding, even if it is part of the mockup.
So, it’s a direct action, connected directly to the actor.
If everytime we add a department we want to see a department we need to add an inclusion from
add to view. In that case it will be directly showed. (add is not shown in the diagram)
If we want to show an acknowledgement we could include add department -> show ack.
An extension of find department can be set parameters, but this advised in case we have default
parameters.
Once a department has been selected, a manager can browse the list of its employees.
Once an employee has been selected, a manager can browse the list of its project, so we have
another extension.
Once an employee has been selected, a manager can assign a project.
Once we view a project, we can show its details.

Putting inside an entity the id of the entity with which it has a 1-* relation is wrong, it’s already
specified by the relation in the diagram.
We can avoid attributes and specify them with a relation to an entity features (relation 1-*) and
use generalizations or even use different relationships with a lot of entities which represents the
features with 1-1 relations.
ACID vs BASE
Databases must allow users to store and retrieve data. To this aim, three tasks must be in charge
to DBMSs:
- Store data persistently
- Maintain data consistency
- Ensure data availability
Most of the recent NoSQL DBMSs can be deployed and used on distributed systems, namely on
multiple servers rather than a single machine.

They can make different copies of data that we can exploit.


Creating distributed systems have different pros.
- Ensure scalability, flexibility, cost control and availability.
- It is easier to add or to remove nodes (horizontal scalability) rather than to add memory or
to upgrade the CPUs of a single server (vertical scalability).
- Allow the implementation of fault tolerance strategies. If one server goes down we can ask
to another server to work and do its job.
Accomplish with the motivations that led to the third databases revolution.
However, there are some cons on using distributed systems.
- To balance the requirements of data consistency and system availability. We must replicate
data and create sharks. A write must be replicated in all servers, and this will create
latency.
- To protect themselves from network failures that may leave some nodes isolated.

Data persistency consists in being sure that data must be stored in a way that is not lost when the
database server is shut down, or for example in case of failure.

Persistency can be achieved in different ways.


Data consistency: data must be correct before and after a transaction.

For example, a financial database and customers that need to add money to the company. We
need to decrease the customer balance and increase the total fund available. In order that Bob to
use the money correctly he has to wait a consistent data.

Data Availability: Data stored in a single server DBMS may be not available for several reasons,
such as failures of the operating system, voltage drops, disks break down. In the figure we show a
possible solution that ensures a high data availability: the two-phase commit. We may think to
have a backup server.
If the primary server goes down, the Backup server takes it places and the DBMS continues to
offer its services to the users.

Into an operation we have to write to a primary server, write the copy to a backup server, wait for
the acknowledgment from the backup server and then wait for the finish ack from the primary
server.
We have to wait more but it’s needed if we want the fully consistency. We also have the network
overhead.
You can have consistent data, and you can have a high-availability database, but transactions will
require longer times to execute than if you did not have those requirements.
Everything depends on functional and non-functional requirements.

An ACID transaction should be:


- Atomic: The transaction is indivisible—either all the statements in the transaction are
applied to the database or none are.
- Consistent: The database remains in a consistent state before and after transaction
execution.
- Isolated: While multiple transactions can be executed by one or more users
simultaneously, one transaction should not see the effects of other in- progress
transactions.
- Durable: Once a transaction is saved to the database, its changes are expected to persist
even if there is a failure of operating system or hardware.
There are some applications in which is not acceptable waiting too much time for having
concurrently consistent data and highly available systems. In this kind of applications, the
availability of the system and its fast response is more important than having consistent data on
the different servers.
There might be a period of time where copies of data have different values, but eventually all
copies will have the same value.

NoSQL databases often implement eventual consistency.


If our main interest is availability we can use two servers, a primary and a secondary, and when I
make a write operation we can make an operation to the server and finish the write and return
the response to consumer
Then there might be a thread that copy the data to the second server while the customer do its
things.
There might be a period of time where copies of data have different values, but eventually all
copies will have the same value.

We can also have available but not consistent.


In the e-commerce scenario the shopping cart may have a backup copy of the cart data that is out
of sync with the primary copy.
The data would still be available if the primary server failed.
The data on the backup server would be inconsistent with data on the primary server if the
primary server failed prior to updating the backup server, but it might not be a big problem.

We can have consistent but not available, we have to wait to complete the multi-phase commit.

When dealing with two-phase commits we can have consistency but at the risk of the most recent
data not being available for a brief period of time.
While the two-phase commit is executing, other queries to the data are blocked, the server will
not be available.
The updated data is unavailable until the two-phase commit finishes. This favors consistency over
availability.
We have a trade-off between availability and consistency.

In case of clusters of server there might be a situation of a network partition.


When the network is working a query might be answered from a server of one side or another. We
need also to update data everywhere.
If we have this network partition the system has two choices: either show each user a different
view of the data (availability but not consistency), users may have inconsistent views. If user A
update data, user B cannot see the update.
The other option is to shut down one of the partitions and disconnect one of the users
(consistency but not availability). A user will live an absence of availability while user A will have
consistency.

All these things are summarized in the CAP theorem (Brewer’s Theorem), the unique theorem in
distributed systems.
Distributed Databases cannot ensure at the same time:
Consistency (C), the presence of consistent copies of data on different servers
Availability (A), namely to providing a response to any query
Partition protection (P), Failures of individual nodes or connections between nodes do not impact
the system as a whole.
At maximum two of the previous features may be found in a distributed database.
This triangles have on vertices three features and on the sides we have different situations we
have when two features are present.

We can have different CA solutions.


Site cluster, therefore all nodes are always in contact, when a partition occurs, the system blocks.
We choose C and A with compromising of P.
Use cases: Banking and Finance application, system which must have transaction e.g. connected to
RDBMS.
Total consistency can affect performance (latency) and scalability, we have the two-phase commit.

We can have AP solutions.


System is still available under partitioning, but some of the data returned may be inaccurate.
We choose A and P with compromising of C.
Notice that this solution may return the most recent version of the data you have, which could be
stale. Indeed, this system state will also accept writes that can be processed later when the
partition is resolved.
Availability is also a compelling option when the system needs to continue to function despite
external errors.
Use cases: shopping carts, News publishing CMS, etc.
We usually have low latency; we don’t have two-phase commit but we lose consistency.

We can have CP solutions.


As suitable for application which require consistency, but also partition tolerance, while somewhat
long response times are acceptable (Bank ATMs).
They are usually based on distributed and replicated relational or NoSQL systems supporting CP.
In most of recent NoSQL frameworks the availability level can be setup with different parameters.
Choose based on the requirement analysis.

This decision is strictly related to function but mostly non-functional requirements.

BASE Properties of NoSQL databases


BA stands for basically available: partial failures of the distributed database may be handled in
order to ensure the availability of the service (often thanks to data replication).
S stands for soft state: data stored in the nodes may be updated with more recent data because of
the eventual consistency model (no user writes may be responsible of the updating!!).
E stands for eventually consistent: at some point in the future, data in all nodes will converge to a
consistent state.

We have ACID transactions that are too ACID for new large-scale application, so we will move
towards BASE.
Let’s talk about the latency issue.
The level of consistency may be tunable. We have different types of eventual consistency:
- Read-Your-Writes Consistency ensures that once a user has updated a record, all of his/her
reads of that record will return the updated value. He will read the correct value but not
other users.
Let’s say Alice updates a customer’s outstanding balance to $1,500.
The update is written to one server and the replication process begins updating other
copies.
During the replication process, Alice queries the customer’s balance. She is guaranteed to
see $1,500 when the database supports read- your-writes consistency.

- Session consistency ensures “read-your-writes consistency” during a session. If the user


ends a session and starts another session with the same DBMS, there is no guarantee the
server will “remember” the writes made by the user in the previous session. Eventually the
data will be updated, we need to consider this.

- Monotonic read consistency ensures that if a user issues a query and sees a result, all the
users will never see an earlier version of the value. If you get a result, another user that
makes the same result cannot receive an earlier result.
Let’s assume Alice is yet again updating a customer’s outstanding balance. The outstanding
balance is currently $1,500. She updates it to $2,500.
Bob queries the database for the customer’s balance and sees that it is $2,500.
If Bob issues the query again, he will see the balance is $2,500 even if all the servers with
copies of that customer’s outstanding balance have not updated to the latest value.

- Monotonic write consistency ensures that if a user makes several update commands, they
will be executed in the order he/she issued them.
Alice is feeling generous today and decides to reduce all customers’ outstanding balances
by 10%. Charlie, one of her customers, has a $1,000 outstanding balance. After the
reduction, Charlie would have a $900 balance. Now imagine if Alice continues to process
orders. Charlie has just ordered $1,100 worth of material. His outstanding balance is now
the sum of the previous outstanding balance ($900) and the amount of the new order
($1,100), namely $2,000. Now consider what would happen if the database performed
Alice’s operations in a different order. Charlie started with a $1,000 outstanding balance.
Next, instead of having the discount applied, his record was first updated with the new
order ($1,100). His outstanding balance becomes $2,100. Now, the 10% discount operation
is executed and his outstanding balance is set to $2,100– $210, namely $1890 (instead of
$2,000).

- Casual consistency says that if an operation logically depends on a preceding operation,


there is a causal relationship between the operations. Causal consistency ensures that
users will observe results that are consistent with the causal relationships.

In the first box we have the timeline with the right order.
If we want to ensure casual consistency, we may accept a view order like the second box.
We cannot accept a view like the third box, we cannot have mike’s posts commented by mike
himself.
We have that comments on mike that have casual consistency on Pietro’s and Luca’s posts.
The comment is logically depending on the post.
I may have a different order of the comments but answers to the comment must be viewed under
the comment itself because it depends on the comment.

Key-value Databases
An array is an ordered list of values, where each value In the array is associated with an integer
index. The values are all the same type.
We have some limitations:
- The index can only be an integer
- The values must all have the same type
Associative arrays are collections of key/value elements that generalize the idea of an ordered list.
The key has the same role of indexes in array but can be a generic identifier (string, integer, ecc.)
The values can be of different types.

Key/value databases are associated to associative arrays, but data are persistently stored in the
disk. There is no need for tables if we do not need groups of related attributes.
The unique requirement is that each value has a unique identifier in the form of key.
We usually define a namespace, the bucket, collection of key/value elements.
Keys must be unique within the namespace defined in each specific bucket.

Developers tend to use key/value databases when ease of storage and retrieval are more
important than organizing data into more complex data structures, such as tables or networks.
Essential features of key/value databases are:
- Simplicity
It’s useful if we do not need all features of the relational models, if we do not have to
define a database schema nor to define data types for each attribute.
We can simply modify the code of our program if we need to handle new attributes,
without the necessity of modifying the databases.
They are also flexible and forgiving, we can make mistakes assigning a wrong type of data
to an attribute or use different types of data for a specific attribute. It will not be a
problem, we don’t get any error of the DBMS, in fact we do not define the type. But we do
not have the check we may have in relation databases.
- Speed
To reduce latency the best way is to have data in the disk.
The key/value databases can write the recently updated value to disk, while the user
program is doing something else.
Operations may be stored first in the cache and then, after returned the control to the
application, the disk can be update.

Read operations can be faster if data is stored in memory. This is the motivation for using a
cache: the first time the program fetches the data, it will need to read from the disk but
after that the results can be saved in memory.
Next time data can be searched in cache and then if not found, in the disk.
We have to consider that RAM is not infinite.
When the key/value database uses all the RAM memory allocated to it, the database will
need to free some records.
Several algorithms exist for selecting the records to remove from the RAM.
The least recently used (LRU) is one of the most used algorithms.
Data that has not been read or written as recently as other data can be removed from the
cache. We are removing in this way data from cache, not from the database.
Use case: key-value DB used to store items in customers’ online carts.
In that case if we have an item inside the memory representing the cart of one user that is
not visiting the site for a while, that data will be removed.
It will always find that when he’ll come back, it’s still in the DB, but we remove the data
from the cache.
- Scalability
The scalability is the capability to add or remove servers from a cluster of servers as
needed to accommodate the load on the system.
These databases can be distributed.
When dealing with NoSQL databases, employed in web and other large-scale applications,
it particularly important to correctly handle both reads and writes when the system has to
be scaled, we need to estimate the amount of reads and writes operations and depending
on their balance we will need to decide the architecture to use.
Dividing in chunks a table of a relational database and know where to deploy them is not
easy.
In the framework of key/value database, two main approached are adopted, namely
master-slave replication and masterless replication.

The master-slave replication is an architecture in which we have a master, that is a server that
accepts write and read requests. This is the only server that can accept write operations.
The master is in charge of collecting all the modification of the data and updating data to all the
slaves. It will propagate all the updates results to other server. Everything about consistency is
managed by the master.
The slaves may only respond to read requests.
This architecture is especially suitable for applications with a growing demand for read operations
instead of write operations.
Pros:
- Very simple architecture: only the master communicates with other server
- There is no need to coordinate write operations or resolve conflicts between multiple
servers accepting writes
Cons:
- If the master fails, the cluster cannot accept writes.
- In this case, the entire system fails or at least loses a critical capacity, accepting writes.
It depends on the application if it’s possible to wait or not.
We may add more complexity in this architecture increasing the availability.
Each single slave server may send a simple message to ask a random server in the cluster if it is still
active, asking other slaves for how long they are receiving messages from the master.
If several slave servers do not receive a message from the master within some period, the slaves
may determine the master has failed. The slaves initiate a protocol to promote one of the slaves as
a new master.

It may be hard to handle the load because we lost one server, but we improve the availability
increasing the complexity.
The master-slave replication architecture will suffer an increasing number of write requests (only
the master accept writes and all the updated must be sent to the slaves).
Use case: applications in which the number of writes may increase and achieve peaks.
The solution is to use the masterless replication, a replication model in a cluster of servers where
all nodes may accept both read/write requests.
In the picture we see a logical view of a possible architecture, that means that one server can
communicate with each server in its side.

In this architecture there is not a single server that has the master copy of updated data.
There is not a single server that can copy its data to all other servers.
Each node is in charge of helping its neighbors.

Let’s see an example.


Each time there is a write operation a server can answer, and it may replicate that change to the
three other servers holding its replica (we consider 4 data replicas). Data are replicated 4 times.
Each server replicates to its two neighbors and to the server two links ahead.

The ring structure is an abstraction. Usually, in data center, all the servers are connected to a
single network hub and are able to communicate with each other.
Let’s suppose to have a dataset divided in 8 chunks and each server can host for replicas, and it’s
in charge to update its data and data of other 3 servers.
If I write on partition 1, it may carry out the update to 2, 7 and 8 and so on, server 2 may update
the chunk in server 3. If server 2 answer it may update the chunks on its neighbors.
Different servers can answer to queries of write to a specific chunk.

How to construct a key


In relational model, the primary key is used to retrieve a row in a table and it must be unique.
Similarly, in a key-value database a key is used to retrieve a value and must be unique in a specific
namespace.
When designing relational databases, using meaningless primary key is a good practice and
primary keys are also immutable. Moreover, we would have to update all references to that key in
other tables. Thus, it’s important to select PK without a specific meaning.
In the framework of key-value databases, adopting meaningful keys is preferable; the key is the
unique mean to retrieve a value.
In key/value databases we just have key/value tuples and no more.
In key-value DBs there are no tables, nor columns, nor pre-defined attributes.
A specific value is associated only to its key.

Let’s suppose to manage a cart application using a key-value database.


If we use meaningless keys, we access the cart as follows:
Cart[12387] = “SK AK231312”.
We use the id of the user.
We may define more interpretable namespace using:
CustName[12387] = “Pietro Ducange”
Now we have just the information on the customer’s name.
We need to define additional namespaces for each attribute, thus moving back to the relation
models, defining a schema.
We can build keys that embed information regarding Entity Name, Entity Identifier and Entity
Attributes.
Let’s suppose we have a customer, and we want to manage this entity.
If we have different attributes and we want to translate this in our key-value database we have to
insert a row for each attribute.
While in a relation DB I would add just a row with all attributes for a specific customer I would
have to insert one value associated with a specific key for each value of the attribute.
For each key, we may store in the database a specific value.

We can use keys that start with customer, then I would have the ID of the customer and the name
of attribute.
For handling customer information, we can just use one namespace.
If I want to add an attribute I can do without problem.
I would not occupy space for null values, I would skip an attribute for a customer.
In the same namespace I would be able to put other entities, for example the orders of the
customer.
The general rule for defining keys is the following (single Entity, no relationships):
Entity Name: Entity ID: Entity Attribute
I would have to add a key/value couple for each attribute
One attribute can also be a complex attribute, for example a JSON complex file, and there will be
other problems to handle, deciding if it’s the case to use key/value databases.
If you have all JSON in your database, it would be better to use a document database.

In general, keys may be strings, numbers, list or other complex structures.


In order to identify a value in the DB, we need the address of the specific location in memory or in
the disk in which this value is stored.
Hash functions are usually used to obtain the address. The has functions return a number, from an
arbitrary key.
Usually, hash functions return values that seem to be random values. Values returned by hash
functions may not be unique (collision problems).
This is an example of hash mapping:
Hashing is also very important because can help us to balance the load on different servers.
We would like to balance the write loads among the swervers in a masterless replication
architecture.

We can use an hash function, the modulus, that gives us the server that answer to a query.
If we divide the hash value (the result) by the number of servers, we can collect the reminder,
namely the modulus, that can be used to identify the server.
The modulo span from 0 to 7.
Each of the 8 servers can be assigned a number (the address).
We generate a sort of load balancing.
Problem: we can use it to avoid selling tickets to the same seat, at the same venue, in the same
city, on the same night to more than one person.

Two fans want to purchase the same ticket, they will if we don’t put attention.
We can use the following key for the specific ticket:

Anyone trying to purchase that same seat on the same night would generate the same key, that
will be transformed by the hashing + modulo function in the same server address.
There is no chance for another server to sell that seat.
One client would have to wait, and this situation could be avoided.

Key/value databases do not expect us to specify types for the values we want to store.
Examples for storing the address of a customer:
Some considerations:
- Consider the design characteristics of the key/value DB we choose to use
- Consult the documentation for limitation on keys and values
- When choosing a key/value DB consider the trade-off of various features
- One key/value DB might offer ACID transactions but set limit on the dimension on keys and
values
- Another key-value data store might allow for large values but limit keys to number or
strings
- Our application requirements should be considered when weighing the advantages and
disadvantaged of different DB systems

The operations allowed in a key-value databases:


- To retrieve a value by key
- To set a value by key
- To delete values by key
If we want to retrieve all people living in a place, we will have to do with an application program,
but we do not have the separation between the DBMS and the application.

If we have a key for a customer describing the address:

We could define a function that would return all addresses with a certain city stating the interval
of IDs of users.

If it’s too complex I made an error on choosing this kind of DB.

Some key-value databases incorporate search functionality directly into the database, indexes.
A built-in search system would index the string values stored in the database and create an index
for rapid retrieval.
The search system keeps a list of words with the keys of each key-value pair in which that word
appears.
It will be a problem in writing because we need to update this data structure but for reading this is
a very good solution and also, we need additional memory to reserve for indexes.

In the key we can specify details regarding the entity and the attribute.

In general, keys may be strings, numbers, list or other complex structures, such as JSON files.
In order to identify a value in the database, we need the “address” of the specific location in which
this value is stored. Hash functions as usually used to obtain the address, namely a number, from
an arbitrary key.
Usually, hash functions return values that seem to be random values.
Values returned by hash functions may be not unique (collision problems!!), we will have to
manage these collisions.
They can be used do distribute the load between different servers in a cluster.
At a minimum, a key is specified as a string of characters: Patient:A384J:Allergies
Strings representing keys should not be too long, or we’ll occupy a lot of memory.
We may encoure the risk of saturating the main memory.
Long keys will use more memory and key-value databases tend to be memory-intensive systems
already.
At the same time, avoid keys that are too short. Short keys are more likely to lead to conflicts in
key names.

A value is an object, so at low level a set of bytes, that has been associated with a key. Values can
be integers, floating-point numbers, strings of characters and even complex objects such as
picture, video, and JSON files.
Key-value implementations will vary in the types of operations supported on values.
Limits on the dimension of a single value may be fixed by the different frameworks for Key-Value
databases, some may have limits on the dimension of the value.

A name space is a collection of key-value pairs that has no duplicate keys.

In some application we may need different modules in which we have the same attribute’s name
but with a different meaning.
ES. Name in social network is the nickname of the participant, and another module of the
application has the name for anagrafic purposes. It would be useful to use separate namespaces.
It is allowed to have duplicate values in a namespace.
Namespaces enable duplicate keys to exist without causing
conflicts by maintaining separate collections of keys.
Namespaces are helpful when multiple applications use the
same key-value database.
Namespaces allows to organize data into subunits.
For each namespace below it will generate a new name with a
prefix, which is the name of the namespace, but it will be
transparent for the developer.

Data Partitioning
Subsets of data (partitions or shards) may be handled by the different nodes of a cluster. A cluster
may contain more than one partition.
Several strategies exists for data partitioning.
The main objective of partitioning data would be to evenly balance, both write and read loads,
among servers.
If needed, additional nodes would be easily added to the cluster and data appropriately relocated.
An idea is to divide keys starting with ‘A’-‘L’ to the first server and the rest for the second. It can be
useful to balance the work, we’re not sure it’s right an half but it’s an idea.
One particular attention will be to relocate data when we add/remove a server.

A partition key identifies the specific partition in which the value has been stored.
Any key in a key-value database is used as a partition key.
In the previous example, the first letter of a key (string) acts as the value of partition key.
Usually, hashing functions are adopted for identifying the specific cluster or partition in which data
should be stored.

We are not required to define all the keys and types of values we will use prior to adding them to
the database, this schema-less property determines the flexibility of our database.
We may decide to change how storing the attributes of a specific entity.
Regarding the example in the table, we might decide that storing a customer’s full name in a single
value is a bad idea, thus we will separate first and last names.
We need to update the application code to handle both ways of representing customer names or
convert all instances of one form into the other or take care about integrity, references. But now
we don’t have these problems, but in contrast we need to pay more attention on the code.

In Key-Value databases, clusters tend to be loosely coupled.


This means that the server are fairly independent and complete many functions on their own with
minimal coordination with other servers in the cluster.
Each server is responsible for the operations on its own partitions and routinely send messages to
each other to indicate they are still functioning.
When a node fails, the other nodes in the cluster can respond by taking over the work of that
node.

In a masterless architecture we can exploit the logical ring organization of servers in which servers
can communicate with their neighbors.

Let consider a hashing function that generates a number from a key and calculates the modulo.
Whenever a piece of data is written to a server, it is also written to the two servers linked to the
original server (high availability). Also, we have load balancing.
If Server 4 fails, both Server 3 and Server 5 could respond to read/write requests for the data on
Server 4. When Server 4 is back online, Servers 3 and 5 can update it.

High availability is ensured by using replication, namely saving multiple copies of the data in the
nodes of a cluster. The number of data replicas is often a parameter to set.
The higher number of replicas, the less likely we will lose data, the lower the performance of the
systems (in terms of response time).
The lower the number of replicas, the better the performance of the systems, the higher the
probability of losing data, which is the probability of all servers to fail.
A low number of replicas may be used whenever data is easily regenerated and reloaded. If we
must regenerate data, we may prefer to reduce the number of replicas.

Most frameworks who exploit NoSQL databases give us the opportunity to set:
N = number of replicas
W = number of copies to be written before the write can complete
R =number of copies to be read for reading a data record, before assuming that the reading is
correct
In the first situation each time we read we want to maintain a strong consistency thus before
returning the control to the application we want to update all copies.
If we read just once, we are sure that we read consistent data in this way, though.
We have the highest level of consistency, the slowest write operations, the fastest read
operations.
In the second case we still maintain a good level of consistency because if writing, before returning
the control, at least we must update one replica.
If we read just once we are not sure that we get the last updated value, but if we leave two times,
we are sure that one copy will be the most recent (we may have timestamp associated), we have a
bit faster write but slower read operation but still providing consistency.
We prefer consistency rather than the availability, relating to the latency,

In the third case we write just once, fastest write, but in order to ensure consistency we have to
read N times, slowest reading.
The fastest situation with just 1 write and 1 read give us the highest availability, the lowest latency
but we may have problems with consistency.
What to choose depends on the context, we must decide the key but also the setup for replicas
and sharding.
For sharding we may use geographical approach coding in the key the country, or we may use
other strategies.
An example of sharding function, given a key, can translate it to a hash value, which in this case is
a number in hexadecimal format.

One of the important characteristics of hash algorithms is that even small changes in the input can
lead to large changes in the output.
Hash functions are generally designed to distribute inputs evenly over the set of all possible
outputs. The output space can be quite large. This is especially useful when hashing keys.
No matter how similar your keys are, they are evenly distributed across the range of possible
output values.

Assume we have a cluster of 16 nodes and each node is responsible for one partition.
The key 'cust:8983:firstName' has a hash value of
4b2cf78c7ed41fe19625d5f4e5e3eab20b064c24
and would be assigned to partition 4.
The key 'cust:8983:lastName' has a hash value of c0017bec2624f736b774efdc61c97f79446fc74f
and would be assigned to node 12 (c is the hexadecimal digit for the base-10 number 12).
Is it a problem? Depends. We can put together firstName and lastName to be sure that we
retrieve both in the same server.

To resolve the collision problem, in which hashing function can have the same output, we can use
a particular collision resolution strategy.
From a logical point of view, the table that projects a hashed key to the corresponding value may
include a list of values.
In the list we append all the values associated with the key.
In each block of the list, also the original key must be present.
We may then scan a list looking for the key in order to retrieve the value. We add complexity but
we have different advantages.

Use Case: adding or removing a node, even for a short time period.
Problem: we need to change the hashing function and to re-locate all the data among the servers
of the cluster.
If we change the number of server if we use the general hashing function, we need to change it, at
least the modulo, we need to increase it or decrease it.
We also need to relocate all records in the database, and we need to stop servers to relocate
everything.
Solution: to exploit the ring structure and a consistent hashing function that allows us to remap
only the keys mapped to the neighbors of the new node.
We have 4 nodes and, in this ring, the portion in the circular array organization are the keys for
each node.

We need to work only on keys on node 1 and 4, and we can keep use the same hashing function.
The same hashing function is applied to both the keys and the server/partition ID/Address/Name.
The hash values must be in the same range, for example hexadecimal on 32-bit representation, in
this case the result will vary from 0 to 2^32.
We will have a distribution of keys in the circular array, but we can also calculate the hashing value
of the id of the server.
A specific server will host all keys which precede it.
The actual server (Node) kj associated to a specific key (object) oi is its successor in the
hashing/address space. We will consider the first server we find in the ring.
If we remove server 2 and add server 4, we must relocate O18 on K3, the first successor, and
relocate O15 in K4.
The hashing function never changes, and we relocate only two objects.
The higher the range, the lower is the probability to have collisions but the more complex will be
the data structure to manage the hashing.
We use the hashing to reload the balancing on the server, but if we change server this may be a
solution.

Data Compression for Key-Value databases


Key/value databases can be exploited as caches to retrieve data from memory, but memory is not
infinite, key-value databases are memory intensive.
Operating systems can exploit virtual memory management, but that entails writing data to disk or
flash storage. Reading from and writing to disk is significantly slower than reading from RAM
memory. One way to optimize memory and persistent storage is to use data compression
techniques. Look for compression algorithms that ensure a trade-off between the speed of
compression/decompression and the size of the compressed data.

If data organization and management is more important than the performances, classical
relational databases are more suitable rather than key-value databases.
However, if we are more interested to the performances (high availability, short response time,
etc.) and/or the data model is not too much complicated (no hierarchical organization, flexibility,
simplicity, limited number of relationships) we may use key-values databases, even together with
other databases.
Indeed, key-value stores are really simple and easy to handle, data can be modeled in a less
complicated manner than in RDBMS.
Let’s consider this data structure:

In a relational model we can define the following tables. The one-to-many


relationship is handled by using a foreign key.

We need to join tables to make this operation and join operations are time consuming.
Now, we want to translate the data modelling from a relational model to a key-value model.
Take in mind that keys embed information regarding Entity Name, Entity Identifier and Entity
Attributes. Thus, we can translate the Employees table as follows:

For each record in the table, we need to create 3 key/value entities.


Everything is codified in the key.
In the same namespace we can create the collection of key/value for the 1:N relation.
As further step, we must translate the Payment table and to manage the one-to-many
relationship.
In this case, we can define the following key-value configuration:
payment:$payment_id:$employee_id:$attribute_name = $value
The Payment table con be translated as follows:
For each couple we will have two entities in our key/value database.
At the end of the translation process, data will be organized in a unique bucket as follows:

It’s easier to make replica and sharding with this table.


If we need to have relations with a more complicated relation than 1:N it’s better if we not use this
implementation of database.

JDBC and Maven


Our application may need to access and manipulate data stored in a DBMS.
Some difficulties may appear (Impedence Mismatch), depending on:
- Different Access Modes: SQL is a set-oriented language shile applications operate on a
single tuple at a time.
- Different Types of Data: each programming language defines its own types of data, which
may differ from those of SQL. Es: Java defines only one type to represent the strings
(java.lang.String), SQL defines several: CHAR, VARCHAR, TEXT, LONGTEXT,
In case we try to map the concepts of the DB with in object-oriented languages there are also
other difficulties called Object-Relational Impedance Mismatch.

DBMSs provide specific connectors (drivers) for certain programming languages.


Connectors are characterized by a low portability: each DBMS has its own API, so the application
becomes dependent on the specific DBMS.
In 1992 Microsoft introduced ODBC (Open DataBase Connectivity), which defines a unified API for
accessing databases in C.
Java introduces JDBC (Java DataBase Connectivity): it is almost equivalent to ODBC, but provides a
Java-specific API (java.sql.*)
In the JDBC architecture we have:

The DriverManager choose the right JDBC driver.


The class DriverManager (java.sql.DriverManager) manages the JDBC drivers and creates
connections to the DBMS.
A connection string is required to obtain a connection, in form of URN (Uniform Resource Name).
The string is specific for DBMS and indicates:
• The type of driver
• The type of DBMS
• The host of the DBMS server
• Access credentials
• The database to which you can connect
Accessing the DB:
Performing queries:

We can use also use for example System.out.format(“%s %s is %d”, firstName, LastName, age),
instead of getString.
If we define objects in the try block they will be automatically closed.
Using the interface java.sql.Statement the SQL statement is compiled at the execution time.
Using prepared statements allows us to pre-fill a statement and improves performance when the
same query is executed multiple times, even with different parameters.
The parameters are indicated by question marks and set with appropriate setters.
INSERT, UPDATE and DELETE statements do not return a ResultSet object.
The executeUpdate() method executes the statement and returns the number of rows actually
inserted/modified/deleted.

We can also make transactions in JDBC, which are set of operations that are atomic, either all of
them or none are executed. For example we want to update salaries in an atomic way.
We can disable autoCommit for executing the two commands before and then commit.
If the commit raises an exception, we would catch it and in case the problem is not in the
connection we would try to rollback.
We can also exploit a safe point so that when we want to rollback, we can specify the safe point.
This can be useful if operations are complicated.
Maven
“Apache Maven is a software project management and comprehension tool. Based on the
concept of a project object model (POM).
Maven can manage a project's build, reporting and documentation from a central piece of
information.”
We will use Maven for managing the dependencies in our JAVA programs.
Roughly speaking, we will avoid to manually add libraries and API.

A POM file is an XML external file that contains information about the project and configuration
details used by Maven to build the project.
It contains default values for most projects.
Whenever we need a new dependency, we add new fields to the dependencies list.
groupId uniquely identifies a project across all projects.
artifactId is the name of the jar without version.
By default, Maven will download from the central repository.

In IntelliJ the project information is inserted automatically.


These can be used by other programs to install my program.
Creating a Maven Project (Using IntelliJ)
1. Create a New Maven Project
2. Insert a name for the project
3. Choose between a simple project or from an archtype
4. Specify GroupID, artifactID and version
5. Add dependencies and right-click > Maven > Update
OR
1. Create a New Java Project
2. Insert a name for the project
3. Choose between a simple project or from an archtype
4. Right click on the project
5. Click on "Add Framework support..."
6. Scroll down and select "Maven"
7. Modify the pom.xml specifying the GroupID, artifactID and version
8. Add dependencies and right-click > Maven > Update
It is possible to convert a Java project into a Maven one.
Whenever we add a dependency or change the version we need to Reload the Project.
Solution of Exercise 1 - Book Shop Application:
The application must be used by the employees of a book shop, mainly for granting a complete
and efficient tracing of the books stored in the book shop.
For each book, the following information are stored: title, author, publisher, price, number of
pages, category, year of publication and number of books in stock. The application also maintains
some information regarding the author(s) (first name, last name) and the publisher (name and
location) to allow the salesman to query books, authors and/or publishers.
The application allows the user to:
- Read the list of the books
- Insert a new book
- Remove a book
- Update the quantity of a book
- Increase/decrease the quantity of a book
- Read the list of the authors
- Insert an author
- Delete an author
- Read the list of the publishers
- Insert a publisher;
- Delete a publisher;
Some constraints: if an author, or a publisher, will be removed from the DB, also all the books
associated with it must be removed. If a book is selected (viewed) also the publisher and the
author(s) must be shown.
Solution of Exercise 2 – Message System:
The application is a messaging system where registered users can create an account, exchange
text messages and make groups.
A registered user can initiate a chat with another user. Moreover he/she can create a new group
chat, acting as the group administrator, and send messages to the chats he/she belongs to, as well
as receiving messages from those chats. He can also leave a group.A group administrator can add
and remove new users to the group. He cannot assign his/her privilege to another user in the
group. A group administrator cannot leave the chat group that he/she created but can delete it.
Every time a user views a chat, all the latest messages from the chat are fetched from the server
and shown to him/her.
An anonymous user must be able to:
- Register in order to become a registered user
A registered user must be able to:
- Retrieve the list of chats he/she is a member of
- Read the messages of a chat
- Send a message to a chat
- Create a new private chat
- Create a new group chat
- Leave a group chat
A group admin must be able to:
- Add members to the group chat
- Remove members from the group chat
- Delete the group chat
The time-based event must be able to update the system on regular intervals to show
chat updates, i.e. new chats or new messages from the current chats, if any.
Login is a pre-requisite, there’s no need to have an include to login for every action. The admin
user can be a specialization of the user, so he can do all things a user does.
We have the extension on view chat so that the timer event can update the chat.
* is for the chat admin.

We don’t have the relation between message and user but at the end they added the senderID in
message.
Key-value databases insights
We have different naming convention for keys:
- Use meaningful and unambiguous naming components, such as 'cust' for customer, and
not for example cc, or 'inv' for inventory.
- Use range-based components when you would like to retrieve ranges of values. Ranges
include dates or integer counters.
- Use a common delimiter when appending components to make a key(e.g. the ‘:’
delimiter ).
- Keep keys as short as possible without sacrificing the other characteristics mentioned in
this list. The higher the lenght of a key, the higher is the occupation of memory.
- Well-designed key pattern helps minimize the amount of code a developer needs to write
to create functions that access and set values.
cust:1234123:firstName: “Pietro” custr:1234123:lastName: “Ducange”
We need to take care in the code that we should use in the firstName cust and in the
lastName custr, which is a problem.
- Using generalized set and get functions helps improve the readability of code and reduces
the repeated use of low-level operations, such as concatenating strings and looking up
values. In this way we can avoid to write again pieces of code.

The first thing to do when having to do with key/value databases is defining these
functions.
AppNameSpace is the name of the namespace holding keys and values for this application.
If we want to use an incremental id, we should use a variable inside.
- Consider to have a naming convention for namespaces.
- In production applications, we should include appropriate error checking and handling. It’s
really important to write test code.

We can deal with ranges of values.


Query: retrieve all customers who made a purchase on a particular date. We can define keys
associated with the customers who purchased products on a specific date as in the following
example:
cust061514:1:custId
cust061514:2:custId
cust061514:3:custId
cust061514:4:custId
We exploit the key/value database for quickly retrieve customers who purchased on that date.
This type of key is useful for querying ranges of keys because you can easily write a function to
retrieve a range of values.
This also avoid making joins between tables.
The following function retrieves a list of customerIDs who made purchases on a particular date:

If the code is too complicated, there must be something wrong with the choice.
An important choice regards whether to use simple or complex values.
Consider the following function which retrieves both the name and the address:

The function makes 6 access to the database (on the disk) using the getCustAttr function.
To speedup the function execution, we should reduce the number of times the developer has to
call getCustAttr or caching data in memory.
We may store commonly used attribute values together as follows:

We use an array of strings.


Key-value databases usually store the entire list together in a data block, thus just one block in the
disk will be accessed, rather than six.
Pay attention with too complex data structure for values.
We have a JSON file with customer’s information and all orders.

This entire structure can be stored under an order key, such as 'ordID:781379’.
We are forcing a schema, but the advantage of using a structure such as this is that much of the
information about orders is available with a single key lookup.
But as the structure grows in size, the time required to read and write the data can increase,
because the value, will be store in more than one memory block.
In general, if we need to use complex structure for the DB of our application, it is better to move
towards different architectures, such as Document Databases.
If we use too complex values it can be affected by the latency to retrieve the object.

Limitation of Key-Value DB
- The only way to look up values is by key.
Some DBMSs for key-value DB offer APIs that support common search features, such as
wildcard searches, proximity searches, range searches, and Boolean operators. Search
functions return a set of keys that have associated values that satisfy the search criteria.
- Some key-value databases do not support range queries.
Some DBMSs (ordered KV databases) keeps a sorted structure that allows for range queries
and/or support secondary indexes and some text search.
- There is no standard query language comparable to SQL for relational databases.
Case Study: K-V DBs for Mobile App Configuration
We consider a Mobile Application used for tracking customer shipments.
The following configuration information about each customer are stored in a centralized
database:
- Customer name and account number
- Default currency for pricing information
- Shipment attributes to appear in the summary dashboard, for example where it started,
where it is now
- Alerts and notification preferences
- User interface options, such as preferred color scheme and font
Most of the operations on the DB will be read operations.
The most suitable solutions are key-value database, even because it is a simple application.
We must define a naming Convention for keys. We choose:
entity type:account number.
Identified entities:
- Customer information, abbreviated ‘cust’
- Dashboard configuration options, abbreviated ‘dshb’
- Alerts and notification specifications, abbreviated ‘alrt’
- User interface configurations, abbreviated ‘ui’
We use complex values, putting things together because we have to read them always together.
We use just one namespace.

The value is a JSON file, and we store them together because this information we will always be
used together.

The options that can be selected are:


Among these we can select 6 values that can be listed there.

We specify that for this user alerts when we have a pick-up and delay event must be sent to these
contacts.

*alrt is a mistake, we should have UI.


This is how a key/value database can be used for configuration.

LevelDB
If data organization and management Is more important than performances, classical relational
databases are more suitable rather than key-value databases.
However, if we are more interested to performances (high availability, short response time, etc..)
and/or the data model is not too much complicated (no hierarchical organization, limited number
of relationships) we may use key/value databases.
Indeed, key/value stores are simple and easy to handle, data can be modeled in a less complicate
manner than in RDBMS.
If I want some features that I find in a cache maybe a key/value database is an optimal choice.

Let’s consider this following data structure:


And let’s design a key-value store starting from RDBMS.
In a relational model we can define the following tables.
The one-to-many relationship is handled by using a foreign key.

We want to translate the data modelling from a relational model to a key/value model.
We need to remember that keys embed information regarding Entity Name, Entity Identifier and
Entity Attributes. The key must be unique, and it also must be easy to retrieve a key/value pair.

Thus, we can translate the Employees table as follows.


The first information should be which table, then which employee which is identified by an id and
then the attribute.
While for the payment table:

At the end of the translation process, data will be organized in a unique bucket:

If we want to select employees in a certain address, we would do:

Usually, key-value DBMSs allow few types of query, such as:


- To retrieve a value by key
- To set a value by key
- To delete values by key

LevelDB is a fast key-value storage library written at Google that provides an ordered mapping
from string keys to string values.
Main features are:
- Keys and values are arbitrary byte arrays
- Data is stored sorted by key (lexicographic style)
- Callers can provide a custom comparison function to override the sort order
- The basic operations are put(key,value), get(key) and delete(key)
- Multiple changes can be made in one atomic batch
- Users can create a transient snapshot to get a consistent view of data
- Forward and backward operation are supported over the data
To import LevelDB on your project you must insert just one of these dependencies:
1. Port of LevelDB’s C++ version

2. JNI interface to C++ version

To use the library, we need to import classes:

To open/close a DB connection:

Keeping a DB connection open consume resources.


The store is nothing but a file in the end.
The three main operations are:
The value are stored as byte arrays, so we need to do this conversion.

We can also destroy the database if it is already present, just to be sure that the file will be empty.

We can also do overloading by the number and type of parameters.


If we don’t want to have exceptions:

LevelDB advanced features are:


- Write batch, which perform multiple put/delete operations as an atomic operation (either
all of them or none)
We apply all operations to the batch and when we want to commit it, we will do all the
batch in the db.
Does the order matter?

The second version will delete a pair we inserted, instead the previous one will delete a key
pair that do not exists. The order matter.
- Iterators, which allow to iterate through the DB without knowing the keys of the
corresponding values

peekNext() will not update the pointer.


- Snapshots, which are consistent read-only views while DB is mutating from other instances.
Maybe because you want to freeze this information.

The snapshot is not a complete copy of the db, it gets resources to do the following
operations.
If we do not destroy the snapshot, differences will get bigger, and we will consume a lot of
resources.
If we know the first part of the key and want to get all pairs that start with that value:

An HashMap has the key with the first data type and the value with the second data type, it stores
key/value pairs.
We will get them in a lexicographic order.

The last one is a dictionary, the union of the first two information, the string conversion of an
hashmap.

We start from where it first finds the prefix, and we exploit the lexicographic order.
The order is lexicographic but the values 10,11 create problems because they will be stored in this
order: 10, 11, 1, 2 3. The ASCII characters are used.

If we want to avoid this, we can use another version of opening the DB connection:
We specify a new comparator, which is an interface provided by Java.

If i == 1 we are comparing numbers. If we find a difference, comparison != 0, we will get out from
the loop.

Document Databases
Document databases are non-relational databases that store data as structured documents,
usually in XML or JSON formats.
They ensure a high flexibility level and allow also to handle complex data structures (despite key-
value databases).
Document databases are schema-less.
One record is usually stored in one document, so they recall the organization of data in OOP
language.
They allow complex operations, such as queries and filtering. Some document databases allows
also ACID transactions.

XML Documents
XML stands for eXtensible Markup Language.
A markup language specifies the structure and content of a document.
Tags are added to the document to provide the extra information.
XML tags give a reader some idea what some of the data means.
XML can represent almost any form of information.

XML and Cascading Style Sheets (CSS) allowed second-generation websites to separate data and
format.
XML is also the basis for many data interchange protocols and was a foundation for web service
specifications such as SOAP (Simple Object Access Protocol).
XML is a standard format for many document types, including word processing documents and
spreadsheets (docx, xlsx and pptx formats are based on XML).
Advantages:
 XML is text(Unicode)based. Can be transmitted efficiently.
 One XML document can be displayed differently in different media and software platforms.
 XML documents can be modularized. Parts can be reused.

The XML ecosystem is composed of:


- XPath: useful to navigate through elements and attributes in an XML document.
- XQuery: is the language for querying XML data and is built on XPath expressions.
- XML schema: A special type of XML document that describes the elements that may be
present in a specified class of XML documents.
- XSLT (Extensible Stylesheet Language Transformations): A language for transforming XML
documents into alternative formats, including non-XML formats such as HTML.
- DOM (Document Object Model): a platform and language-neutral interface for dynamically
managing the content, structure and style of documents such as XML and XHTML. A
document is handled as tree.

XML databases are platforms that implement the various XML standards such as XQuery and XSLT,
They provide services for the storage, indexing, security, and concurrent access of XML files.
XML databases did not represent an alternative for RDBMSs.
On the other hand, some RDBMSs introduced XML , allowing the storage of XML documents
within A BLOB (binary large object) columns.

We may have a storage layer in which we put the data, where we can find binary XML, the DOM,
indexes for a better navigation in the XML files.
We access the data using a processing layer where we have engines and a module for loading
documents. We can then have a client layer for interacting with the DB.
The drawbacks of XML are:
- XML tags are verbose and repetitious, thus the amount of storage required increases.
- XML documents are wasteful of space and are also computationally expensive to parse.
- In general, XML databases are used as content-management systems: collections of text
files (such as academic papers and business documents) are organized and maintained in
XML format.
- On the other hand, JSON-based document databases are more suitable to support web-
based operational workloads, such as storing and modifying dynamic contents.

JSON Documents
JSON is the acronym of JavaScript Object Notation.
It is used to format data.
Thanks to its integration with JavaScript, a JSON document has been often preferred to an XML
document for data interchanging on the Internet.
Despite the name, JSON is a (mostly) language-independent way of specifying objects as name-
value pairs

An object, document, is an unordered set of name/value pairs


– The pairs are enclosed within braces, { }
– There is a colon between the name and the value
– Pairs are separated by commas
Example: { "name": "html", "years": 5 }
An array is an ordered collection of values
– The values are enclosed within brackets, [ ]
– Values are separated by commas
Example: [ "html", ”xml", "css" ]
Example of nested objects:

We have a collection of documents, the value of pairs are other documents.


A value can be a string, a number, true, false, null, an object, or an array
– Values can be nested
• Strings are enclosed in double quotes, and can contain the usual assortment of escaped
characters
• Numbers have the usual C/C++/Java syntax, including exponential (E) notation
– All numbers are decimal--no octal or hexadecimal

Example of nested arrays:


Document can include vectors that can include other objects.

Similarities:
– Both are human readable
– Both have very simple syntax
– Both are hierarchical
– Both are language independent
– Both supported in APIs of many programming languages
Differences:
– Syntax is different
– JSON is less verbose
– JSON includes arrays, in XML we have to put the tags
– Names in JSON must not be JavaScript reserved words

In XML for each user we have to repeat opening and closing tags.
Main features of JSON databases:
 Data is stored in JSON format.
 A document is the basic unit of storage. It includes one or more more key-value pairs and
may also contain nested documents and arrays.
 Arrays may also contain documents allowing for a complex hierarchical structure.
 A collection or data bucket is a set of documents sharing some common purpose. We can
collect items with different structures with just some common fields.
 Schema less: predefined document elements must not be defined.
 Polymorphic Scheme: the documents in a collection may be different.
A schema is a specification that describes the structure of an object, such as a table.
Data modelers have to define tables in a relational database before developers can execute code
to add, remove, or update rows in the table. Altering the schema after is not a good practice.
Document databases do not require this formal definition step.
Developers can create collections and documents in collections by simply inserting them into the
database.
Schema-less Pros: High flexibility in handling the structure of the objects to store

Schema-less Cons: the DBMS may be not allowed to enforce rules based on the structure of the
data.

A document database could theoretically implement a third normal form schema.


Tables, as in relational databases, may be “simulated” considering collections with JSON
documents with an identical pre-defined structure.

We can use mongoDB a relational database. If we have a many-many relation, in a RDB we would
need a solution like this. In a RDBMS all checks will be automatically done.
We can implement this in a document database. We have documents describing fields, and we
have an array of documents specifying actors.
Document databases usually adopts a reduced number of collections for modeling data.
Nested documents are used for representing relationships among the different entities.
Document databases do not generally provide join operations. In this approach we do not need
to do them if we want to retrieve actors who made films.
We will have redundancy, with information about actors, and in changing the description of one
actor we need to change it in all documents of fields, and we also have to do different checks, we
have to scan more information.
Programmers like to have the JSON structure map closely to the object structure of their code!

This approach is called Document Embedding.


The solution above allows the user to retrieve a film and all its actors in a single operation.
However, “actors” result to be duplicated across multiple documents.
In a complex design this could lead to issues and possibly inconsistencies if any of the “actor”
attributes need to be changed. Moreover, some JSON databases have some limitations of the
maximum dimension of a single document.
When we must choose if embed a document in another, we have to think about our application.
e.g. we show always comments with a post in a social network application
Another approach is Document Linking.
In the solution above, an array of actor IDs has been embedded into the film document.
The IDs can be used to retrieve the documents of the actors (in on other collection) who appear in
a film.
e.g. useful if we browse a list of films and if we select the film we may browse all actors if needed
We are rolling back to a relational model at a certain level. Now, at least two collections of
documents must be defined.
We are rolling back to the third normal form if we use three collections which simulate the three
pairs.

If we add another collection that puts together the information about films and authors. This is
not the right thing to do.

Document linking approaches are usually somewhat an unnatural style for a document database.
However, for some workloads it may provide the best balance between performance and
maintainability.
When modeling data for document databases, there is no equivalent of third normal form that
defines a “correct” model.
In this context, the nature of the queries to be executed drives the approach to model data.
We collect the need of the client, we identify use cases and then think about how organizing the
database. From requirements we will think about queries in our database.

Let’s see how to model a one-to-many relationship.


Let consider the customer entity which may have associated a list of address entities.
The basic pattern is that the one entity in a one-to-many relation is the primary document, and
the many entities are represented as an array of embedded documents.
If we need to retrieve such information together, we may think to use it, otherwise we may think
to use document linking.

Let’s see how to model a many to many relationship.


Let consider an example of application in which:
 A student can be enrolled in many courses
 A course can have many students enrolled to it
We can model this situation considering the following two collections:

We exploit a document linking approach, for each course we specify the enrolled students with
the ID of them. Same approach for students.
We have to take care when updating data in this kind of relationship, about the integrity of
references. We don’t know it that ID represents a student or a course.
Indeed, the DBMS will not control the referential integrity as in relational DBMSs.
We still don’t have a fixed schema.
Depending on the queries we have to do, it could be preferable to use document embedding.

Sometimes we have objects organized in hierarchies.


Parent reference solution:
This solution is useful if we have frequently to show a specific instance of an object and then show
the more general type of that category.

We specify in the same document the ID of its parent, and after its info, and the same for its
parent until arriving to the root.

Child reference solution:


This solution is useful if we have frequently to retrieve the children (or sub parts) of a specific
instance of an object.

A possible application is a GUI in which we can select one object considering the categories.

List of ancestor solutions:


This solution allows to retrieve the full path of the ancestors with one read operation.
A change to the hierarchy may require many write operations, we must re-organize all references,
depending on the level at which the change occurred.

The solution to use depends on the use of our data.


Scenario: Trucks in a company fleet have to transmit location, fuel consumption and other metrics
every three minutes to a fleet management database (one-to-many relationship between a truck
and the transmitted details)
We may consider generating a new document to add to the DB for each data transmission.
Let consider a document as follows:

We have repeated information in each new document in this way. truck_id, deriver_name and
date are stored and transmitted each time.
At the end of the day, the DB will include 200 new documents for each truck (we consider 20
transmissions per hour, 10 working hours).

An alternative solution may be to use embedded documents as follows:

We organize a document for each day, with an header and a field with a vector in which we specify
time and fuel_consumption. We reduce the amount of information to transmit and store.
We also have a better organization of data for making analytics.
Pay attention: we can have a potential performance problem!
When a document is created, the DBMS allocates a certain amount of spaces for the document.
It exects that the document will be modified and so it allocates more space than needed at the
beginning.

If the document grows more than the allocated space, the DBMS has to relocate it to another
location.
Moreover, the DBMS must free the previously allocated space.
The previous steps can adversely affect the system performance.
A solution for avoiding to move oversized document is to allocate sufficient space at the moment
in which the document is created with default values.
We can consider the maximum dimension of the document. If we realize later that the allocated
memory is bigger we can use splitting. We can have more documents, one for the morning, one
for the evening and one for the night.
Regarding the previous problem, the following solution may be adopted:
In conclusion, we have to consider the life cycle of a document and planning, if possible, the
strategies for handling its growing.

Indexing document database


In order to avoid the entire scan of the overall database, DBMSs for document databases (for
example MongoDB) allow the definition of indexes.
For example, retrieve all students in a database where there’s no order by age, we will have to
scan all the db.
Indexes, like in book indexes, are a structured set of information that maps from one attribute to
related information.
In general, indexes are special data structures that store a small portion of the collection’s data
set in an easy to traverse form.
The index stores the value of a specific field or set of fields, ordered by the value of the field.
The ordering of the index entries supports efficient equality matches and range-based query
operations.

The document is not ordered by store but we may need to retrieve students with a certain range
of scores. We could define an index building a data structure which is a data structure with a
collection of pointers towards documents with that specific score.

In the figure we show the classical scheme of a read-heavy application (business intelligence and
analytics applications) where indexes are really useful.
In this kind of applications, the use of several indexes allows the user to quickly access to the
database. For example, indexes can be defined for easily retrieve documents describing objects
related to a specific geographic region or to a specific type.

The example of the truck information transmission (each three minutes) is a typical write-heavy
application.
The higher the number of indexes adopted the higher the amount of time required for closing a
write operation. Indeed, all the indexes must be updated (and created at the beginning).
Reducing the number of indexes, allow us to obtain systems with fast write operation responses.
On the other hand, we have to accept to deal with slow read operations.
In conclusion, the number and the type of indexes to adopt must be identified as a trade-off
solution.

A more complicated system is a transaction processing system.


These systems are designed for fast write operation and targeted reads, as shown in the figure
below:

We may think to have two databases. A first DB, maybe a relational one, tuned for writes and
another where we maintain information of analytics.
Usually in this last db information are not extracted in real time.
In the first we may have rough data and this can be extracted, transformed, and loaded to the db
devoted for analytics. These operations can be done at night too.
There’s no need to think about consistency in the right.
We have different types of data regarding the same process.
Maybe we could use a document db in the left and a graph db in the right.

Vertical partitioning is a technique for improving (relational) database performance by separating


columns of a relational table into multiple separate table, useful when some data are not needed
to be retrieved together.
E.g. we want the list of all images’ information without the images themselves.

In this way, this information is not stored in the same portion of the disk.

Sharding or horizontal partitioning is the process of dividing data into blocks or chunks.
Each block, labeled as shard, is deployed on a specific node (server) of a cluster.
Each node can contain only one shard but in case of data replication, a shard can be hosted by
more than one node. It ensures a higher availability and a lower latency.

Advantages of sharding:
 Allows handling heavy loads (load balancing) and managing peaks of increase of system
users.
 Data may be easily distributed on a variable number of servers that may be added or
removed by request.
 Cheaper than vertical scaling (adding ram and disks, upgrading CPUs to a single server).
 Combined with replications, ensures a high availability of the system and fast responses.

To implement sharding, document database designers have to select a shard key and a
partitioning method.
A shard key is one or more fields, that exist in all documents in a collection, that is used to separate
documents.
Examples of shard keys may be: Unique document ID, Name, Date, such as creation date, Category
or type, Geographical region.
Any atomic field in a document may be chosen as a shard key.
We need servers to store data and a server to manage sharding. A possible sharding technique
could be geographical.
There are three main categories of partition algorithms, based on:
 Range: for example, if all documents in a collection had a creation date field, it could be
used to partition documents into monthly shards.
 Hashing: a hash function can be used to determine where to place a document. For
example, hashing of the key. Consistent hashing may be also used. If we make queries of
averages, it should be avoided for example, otherwise we need to access to different
servers.
 List: for example, let imagine a product database with several types (electronics,
appliances, household goods, books, and clothes). These product types could be used as a
shard key to allocate documents across five different servers. For example, all data about
electronics can be placed in a server.

If we make analytics, we may look for data from different servers. We could lose in terms of
performance.
Which one to use depends on the requirements, the collection, the queries.
Collections are sets of documents.
A collection can store documents of different types (no need of a specific structure/scheme for a
document).
In general, collections should store documents about the same type of entity.
What is the «type» of entity?

Are these two entities, or are they two instances of the same entity?
It depends on how we are going to use this data.
Let’s consider them as instances of the same entity.
Entity Name: System Event
We need to add a field, doc_type, to discriminate clickstream from server logs.

If you were to store these two document types in a single collection, you would likely need to add
a type indicator so your code could easily distinguish between a web clickstream document and a
server log event document.
If we do something like this data will be stored in the same block of the disk. Mixing document
types in the same collection can lead to multiple document types in a disk data block.
This can lead to inefficiencies whenever data is read from disk but not used by the application that
filters documents based on type. If we use this information separately. this is inefficient.
Filtering collections is often slower than working directly with multiple collections, each of which
contains a single document type.
If you are using disk storage, you will likely retrieve blocks of data that contain both clickstream
and server log documents. This will adversely impact performance.
If we make analysis on timestamp for example, this is not a problem but if for example, we make
queries to find clickstreams this is not a good solution.

We can use indexing but in general we have to think at how to use data.
If the decide to use indexes, recall that they reference a data block that contains both clickstream
and server log data, the disk will read both types of records even though one will be filtered out in
your application.
We don’t have to scan all of them, we can use the index and retrieve one specific data type.
In the case of large collection and large number of distinct “types” of documents, it may be faster
to scan the full document collection rather than use an index. Also, we have to update indexes
after each update.
The tip is to store different types of documents in the same collection only if they will be
frequently used together in the application.

In general, the application code written for manipulating a collection should have:
1)  A substantial amount of code that apply to all documents
2)  Some amount of code that accommodates specialized fields in some documents.
The case of High-Level Branching like in the picture, can
indicate a need to create separate collections.
Branching at lower levels is common when some documents have optional attributes.
This example is of documents of different types used often together.
process_click_stream and process_server_log are functions specific. We use two different types of
code for elaborating data.
There’s no need in the first case to put together documents, but if we have a general part like the
lower-level branching, it is correct to put together information in the same collection.
The only difference is that on ebook we have a size field, that can be added on the description.

Let’s suppose that we have this type of data, with an abstract class Product and some
specialization.

How would we organize in collections in a document database?


Rather than start with data and try to figure out how to organize your collections, it can help to
start with queries to understand how your data will be used. This can help inform your decisions
about how to structure collections and documents.
Our application might to be able to answer the following queries:
• What is the average number of products bought by each customer?
• What is the range of number of products purchased by customers
• What are the top 20 most popular products by customer state?
• What is the average value of sales by customer state?
All the queries use data from all product types, and only the last query subtotals the number of
products sold by type. This is a good indication that all the products should be in a single
document collection
Another reason to favor a single document collection is that the client is growing and will likely
add new product types. If the number of product types grows into the tens or even hundreds, the
number of collections would become unwieldy.
There is no query about books, CDs, or smart appliance. We’re making analytics on customers, in
general. We will access all products together.
If we use separate entities, we will need 3 collections. If our query request more than one
collection maybe we’re making a mistake, we’re making join operations and moving towards
relational databases.

For a specific query we have to consider one only collection, two in some cases.
We have to consider the customer too then.
Notice that: If we separate the product into different collections, and the number of product types
grows the number of collections would become unwieldy.
If you find yourself writing separate code to process different document subtypes, you should
consider separating the types into different collections. Poor collection design can adversely affect
performance and slow your application. There are cases where it makes sense to group somewhat
dissimilar objects (for example, small kitchen appliances and books) if they are treated as similar
(for example, they are all products) in your application.

Database normalization is the process of organizing data into tables in such a way as to reduce the
potential for data anomalies. An anomaly is an inconsistency in the data.
Normalization helps avoid data anomalies, but it can cause performance problems.
With normalized data, we need join operations, which must be optimized for improving
performances. It also reduces the amount of redundant data in the database.
If we use denormalized data, we may introduce redundancies and cause anomalies. If we change
information without doing it on duplicates, we may have anomalies.
On the other hand, we may improve the performances of the queries because we reduce the
number of collections and avoid join operations.
Denormalization supports improving read operations when indexes are adopted.
Designing databases entails trade-offs. You could design a highly normalized database with no
redundant data but suffer poor performance. When that happens, many designers turn to
denormalization.
As the name implies, denormalization undoes normalization—specifically, it introduces redundant
data. You might wonder, why introduce redundant data? It can cause data anomalies.
It obviously requires more storage to keep redundant copies of data. The reason to risk data
anomalies and use additional storage is that denormalization can significantly improve
performance.
When data is denormalized, there is no need to read data from multiple tables and perform joins
on the data from the multiple collections. Instead, data is retrieved from a single collection or
document. This can be much faster than retrieving from multiple collections.
In general, we have to achieve a reduced number of collections, especially involved in queries.
How do document data modelers and application developers get better performance? They
minimize the need for joins. This process is known as denormalization. The basic idea is that data
models should store data that is used together in a single data structure, such as a table in a
relational database or a document in a document database.
For those using document databases, avoiding data anomalies is still important, but they are
willing to assume more responsibility to prevent them in return for scalability and flexibility.
For example, if there are redundant copies of customer addresses in the database, an application
developer could implement a customer address update function that updates all copies of an
address.
By denormalizing the design, you could create a collection of documents that would require only
one lookup operation.

A better choice could be:

Avoid overusing denormalization.


If your application requirements are such that storing related information in two or more
collections is an optimal design choice, then make that choice.
If there are N documents in collection1 and M documents in collection2, this statement would
execute N × M times. The execution time for such loops can grow quickly.
If you are dealing with collections this large, you will want to use indexes, filtering, and, in some
cases, sorting to optimize your join by reducing the number of overall operations performed
Normalization is a useful technique for reducing the chances of introducing data anomalies.

Exercise on Document DB
Put fields in UML.
No search film, directly browse films.
No find by parameter, but find film
No by country but set country.
We can choose to do not put film information directly on the film, doing it separately like this.
But if we do analytics, it doesn’t make sense to have it separated.
We can have analytic use cases directly connected to the user, if we make analytics on a single
entity, for example film, we could put them connected to display film as an extend.
*View top rated films (specify the subject of the action)
Get top users is not that right, view user is better.
They decided to use two collections.
Delete on admin is wrong and separate view top rated for country and for year.
They have an array of ratings in which they put the rate a user may generate. Exploiting it they
were able to make statistics on the top-rated films.
Top rated production = production with top rated films, in this way it is possible to do it fastly.
They put writer and authors together with movies, but they did not use an array of documents
describing actors or writers but just a string, not a good choice. We need all fields.
They must embed them but with a vector of documents. Every time we have a one-to-many
relationship, we need to use an array.
We also have the production and the country, but there could have been a better choice for this. It
is a denormalized version that could have been better.
For the user consider using security aspects for password, use an array of rates done to movies
with sub-information about movies.
They also realized an index for production.
It’s acceptable to have redundancy but not that much, queries with 2 collections involved can be
accepted.
It was better to have language inside without an entity.

Column Databases
We have two main activities in a system:
- OLTP - Online Transaction Processing activities: software architectures oriented towards
handling ACID transactions.
- OLAP - Online Analytics Processing activities: software architectures oriented towards
interactive and fast analysis of data. Typical of Business Intelligence Software.
We have two servers devoted in one side on OLTP activities and on the other side OLAP activities.
Usually, OLTP systems exploit normalized data and queries we can do is typically on “Who bought
something”.
On the other side we have a system devoted to analytics, there is no interest in fast queries,

Since the beginning of digital files, the data of each record were physically organized in rows,
which were located in one section of the disk.
OLTP processing is mainly oriented towards handling one record at a time processing, so there’s
no problem with this organization.
When the first relational databases were designed, the world was mainly experiencing the OLTP
era.
The record-oriented workflow handled by the first relational databases and the row-oriented
physical structure of early digital files provided good performance.
In the record-based processing era (up to the end of the 80s of the last century), CRUD operations
(Create, Read, Update, Delete) were the most time-critical ones.
Reporting programs were typically iterated through entire tables and were run in a background
batch mode. The time response to the queries was not a critical issue.
When Business Intelligence software started to spread, OLAP processing assumed a big relevance,
thus the time response to queries became a critical issue to appropriately handle.
In the last decades Data warehouses appeared.
Data warehouses: information systems in which relational databases are tasked with supporting
analytic and decision support applications.
The OLAP part was supported by another DBMS organized in an appropriate way based on
relational database, devoted to providing analytic and decision service.
We have a separation between the two. In the OLAP side we generate analytics that can be used
in the business field and produce transactions, that thanks to ETL processes can pass from a
normalized to a denormalized version. So, transactions are transformed and used data warehouse
to produce reporting.
We need a different organization in order to produce fast queries.
We get a sepration for OLTP and OLAP and on the OLAP we have the data warehouse.

A star schema was exploited in data warehouse.

It is a relational solution where central large fact tables are associated with numerous smaller
dimension tables.
In this example fact regards quantitative facts, quantity, price, and discount.
Then we need to know what customer, store, product, and time. These are dimensions.
Aggregate queries could execute quickly with this organization thanks to the separation.
If we want a statistic regarding the customer, we involve the facts and the customer dimension.
Star Schemes do not represent a fully normalized relational model of data (redundancies are often
accepted, information sometimes depends partly on the primary key).
We have higher redundancy in …
The main drawbacks of star schemas are:
- Data processing in data warehouses remained severely CPU and IO intensive.
- The data volumes grew up along with and user demands for interactive response times.
- In general, around mid 90s of the last century, the discontent with traditional data
warehousing performance increased.

The columnar storage


When processing data for making analytics, in most of cases, we are not interested in retrieving
all the information of each single records.
Indeed, we are interested, for example, in retrieving the values of one attribute of a set of records
(for making trend graphs, or calculating statistics).
When dealing with row-based storage of records, we have to access to all the records of the
considered set for retrieving just the values of one attribute.
If all the values of an attribute are grouped together on the disk (or in the block of a disk), the task
of retrieving the values of one attribute will be faster than a row-based storage of records.

If data are stored by rows, in order to retrieve the sum of salaries we must scan five blocks.
In the case of column store, a single block access is enough.
In general, queries that work across multiple rows are significantly accelerated in a columnar
database.

Columnar Compression
Data compression algorithms basically remove redundancy within data values.
The higher the redundancy, the higher the compression ratio.
To find redundancy, data compressions algorithms usually try to work on localized subsets of the
data.
If data are stored in a column database, very high compression ratio can be achieved with very low
computational overheads.
As an example, often in a columnar database data are stored in a sorted order. In this case, very
high compression ratios can be achieved simply by representing each column value as a “delta”
from the preceding column value.
Column databases perform poorly during single-row modifications. In level of rows we can find
problems.

The overhead needed for reading rows, can be partly reduced by caching and multicolumn
projections (storing multiple columns together on disk).
In general, handling a row means accessing to more than once blocks of the disk.

Problem: The basic column storage architecture is unable to cope with constant stream of row-
level modifications.
These modifications are needed whenever data warehouse have to provide real-time “up-to-the-
minute” information.
Solution: Delta store is an area of the database, resident in in-memory, uncompressed and
optimized for high-frequency data modifications. It is a row-oriented area in memory.
Data in the delta store is periodically merged with the main columnar-oriented store (often data
are highly compressed). If we can accept in a query only the information stored in the columnar
part we can use only this, otherwise queries may need to access both the delta store and the
column store in order to return complete and accurate results.
In this architecture we have the column store, we have bulk loads which are loads of group of data
using ETL process from a database devoted to collect all transactions.

We may have another kind of problem.


Complex queries often need to read combinations of column data. We may have queries involving
multiple columns. Nobody ensures us they are in close portions of the disk.
Solution: Some column databases adopt projections.
Projections: combinations of columns that are frequently accessed together.
Columns of specific projections are stored together on the disk.

If we want to make queries that make accesses to more than one portion of the disk.
The super-projection contains all columns with unsorted data.
In the first projection we have redundancy but support to have just one access to the disk.
This will ensure low latency.
Projections can be created manually or through tools that based on queries give us projections.
Hybrid Columnar Compression (Oracle Solution)
We may need to edit rows and also need support for analytics on attributes.
At high-level, rows of data are contained within compression units of about 1 MB.
Inside these blocks we have columnar organization within smaller 8KB blocks.

Key-Value and Document Databases: Some issues


NoSQL databases may help us to solve the problem of handling huge amount of data.
Key-value databases lack support for organizing many columns and keeping frequently used data
together.
Document databases may not have some of the useful features of relational databases, such as
SQL-like query language.

BigTable was introduced in 2006 by Google and was one of the first column databases.
Main Features:
 Column based with a dynamic control over columns
 Data values are indexed by row identifier, column name, and a time stamp
 Reads and writes of a row are atomic
 Rows are maintained in a sorted order
Rows are composed of several column families. Each family consists of a set of related columns.
 Each family consists of a set of related columns (frequently used together.)
 Column values within a column family are kept together on the disk, whereas values of the
same row belonging to different families may be stored far.
 Columns families must be defined a priori (like relational tables).
 Applications may add and remove columns inside families (like key-value pairs).

A data value is indexed by its:


 row identifier, which uniquely identify a row (similar role than a primary key in relational
databases).
 column name, which identifies a column (similar role than a key in key-value databases)
 time stamp, which allows the existence of multi versions of a column value can exist. When
a new value is written to a BigTable database, the old value is not overwritten. Instead, a
new value is added along with a time stamp. The time stamp allows applications to
determine the latest version of a column value.

We have to define a keyspace whenever we work with column DBs, which is the analogous of
key/value databases.
It is the top-level logical container. It holds column families, row keys, and related data structures.
Typically, there is one keyspace per application.
Row keys are one of the components used to uniquely identify values stored in a database.

Row keys are also used to partition and to order data, to create shards, distributed in the different
servers.
Algorithms for data partitioning and ordering depends on specific databases.
As an example, Cassandra adopts the partitioner, a specific object, which is used for sorting data.
Moreover, by default, the partitioner randomly distributes rows across nodes.

A column, along with a row key and version stamp, uniquely identifies values.
It has the same role of a key in key/value databases.
Values may be typed, such as in Cassandra, or not, such as in BigTable.
The time stamp or other version stamp is a way to order values of a column.
Columns that are frequently used together are grouped into the same column family.
Column families are analogous to a table in a relational database.
The entries regarding the address can be put together in a column family.
However:
- Rows in columns families are usually ordered and versioned (using timestamp)
- Columns in a family can be modified dynamically.
- Modifications just require making a reference to them from the client application
We do not occupy space for null values and we can make analytics in a fast way with this
approach.

The most recent column databases are designed for ensuring high availability and scalability,
along with different consistency levels.
Clusters and partitions are fundamental for the distributed implementation of the databases.
Thanks to partitions we can help the system to balance the load.
A partition is a logical subset of a database.
Partitions are usually used to store a set of data based on some attribute of the data in column
databases, they can be generated on the basis of:
 A range of values, such as the value of a row ID (all IDs in a range)
 A hash value, such as the hash value of a column name or the row ID or the columnar
family name
 A list of values, such as the names of states or provinces (e.g. Canada in a server and USA in
another)
 A composite of two or more of the above options
We do not have to access to different server to get information from a column family, this is
important to achieve.

There exists two kinds of Column DB Architectures:


- an architecture of HBase that is the Apache version of BigTable; it exploits the Hadoop
environment. It has a master/slave organization, and the slaves are called Region Servers,
which are in charge to manage user requests and to specify specific functions to the
ecosystem. They may include a collection of regions, where a region contains all rows of
some specified keys. At the level of the region, we have data structures devoted to the
organization of the data (data node) and namespace (name node).
And they’re deployed on different servers. The zookeeper is in charge to maintain the
communication between the nodes.
We have different SPOF, for example the zookeeper can fail.
- A Cassandra peer-to-peer architecture which is masterless. Here each node can run the
same program. In this case we do not have unique POF, remaining nodes can help the
entire system to continue to work.
This is useful for horizontal scaling; we can add or remove nodes in function of needs.

Let’s see a collection of data structures and strategies that are used in Cassandra.
If we make a write operation on a server, we need to wait the time to store data on the disk, and
we’re not sure data are stored in adjacent positions. We have a latency to read and return the
control to the application. We can speed-up this problem with commit logs.

Commit Logs are append only files that always write data to the end of the file. We write one
record after the other.
These files allow column databases to avoid waiting too much for the definitive writes on
persistent supports.
Whenever a write operation is sent to the database, is stored into a commit log. Then, the data is
updated into the specific server and disk block. It ensures a high level of control of failures. If while
updating we have a failure we lose information.
In this case if we have any failure, we can exploit the commit log to recover the information.
After a failure, the recovery process reads the commit logs and writes entries to partitions.
It reduces latency in write operations and can be used in case of failures to recover information.
Bloom filters are probabilistic data structures.
Bloom filters help to reduce the number of reading operations, avoiding reading partitions that
certainly do not contain a specific searched data.
Given a row key, a column and timestamp we are not sure that making access to the disk we will
retrieve the data.
A Bloom filter tests whether or not an element is a member of a set: the answer of the test is “no”
or “maybe”. Thus, false positive may be obtained.
However, if the answer is “no” for sure the element, namely the searched data, is not present into
the specific partition. False negatives are not allowed, if the answer is “no” we are sure the data is
not present in the database. We limit the number of accesses in the disk.

If we use a member function, we have to store a lot of information, we need more memory but we
are sure if an element is present or not. In a bloom filter we can have some “no” sure answers,
and in case of “yes” we are not sure this is not a FP. For a and c we are sure, not for e.
Let’s see how to implement a bloom filter for a partition.
Let consider a specific partition. A bloom filter for this partition can be an array of n binary values.
We want to know if that specific shard contains or not data. Before accessing we want to know if
data is stored or not.
At the beginning, all the elements are set to 0, no data are stored inside.
Let consider also a set of k << n hash functions (n is the dimension of the array) which map each
data to be stored into the partition in an element of the array, into one position.
- When a new data is added (write) to the partition, all the hash functions are calculated and
the corresponding bit are set to one.
- When a read operation is required, all the hash functions are calculated and the access to
the partition is actually performed if and only if all the corresponding element in the array
are set to 1, if the k positions are all set to 1.
The maximum false positive rate depends on the values of n and k.
Let consider a Bloom Filter with n=10 and k=3.
The Insert function describes a write operation.
If we want to insert x we calculate k hash values and set them to 1.
The Lookup function describes a read operation. If we have an index set to 0 we will be sure that
y is not present in the partition.
If we would have found all 1s we would have returned maybe. Maybe another value activated this
bit.
This filter cannot be flushed, if we remove one data we cannot put to 0 the corresponding indexes
of the hash function, we do not know if any other data have activated the same index, or we
would allow FN.
Removing an element from the simple Bloom filter is impossible because false negatives are not
allowed.
We reduce latency when we perform read operations with this.
The complexity to access to the bloom filter is a constant, so it is to add or check.
Important property of Bloom filters is that the time it takes both to add an element and to check if
an element is in the set is constant.
The probability of a false positive decreases as n, the dimension of the bloom filter, increases,
while it increases as the number of elements stored in the Bloom filter decreases. The longer the
vector, the lower will be the probability to have FP but it increases with the number of elements
we are storing in the DB.
Suggestion: use n = O(h) and k=ln2(n/h) where h is the number of value to check against the filter.
If we know how many checks we will do (estimated) we can set them in this way.
If h = 10, we could choose n = 20,30,50.

Consistency level refers to the consistency between copies of data on different replicas.
We can consider different level of consistency:
 Strict consistency: all replicas must have the same data
 Low consistency: the data must be written in at least one replica
 Moderate consistency: all the intermediate solutions, for example a certain number of
replicas must be updated before returning the control to the application and then for
others we can have eventual consistency.
The consistency level depends on the specific application requirements. We must decide two
sides of the CAP triangle.

Consistency level is strictly related to the replication process.


Replication process regards where place replicas and how to keep them updated.
Usually, there is a pre-defined server for the first replica which is in charge of identifying the
position of the other replicas among the available servers.
The primary server is often determined by a hash function.
Each database has its own replication strategies.
For example, Cassandra uses clusters organized in logical rings and additional replicas are stored in
successive nodes in the rings in clockwise directions.

Anti-entropy is the process of detecting differences in replicas. It is important to detect and


resolve inconsistencies among replicas exchanging a low amount of data.
Checking hand by hand is heavy from a computational point of view and we don’t have to
exchange a lot of data. We resolve this sending hashes of data instead of the data itself.
Hash trees may be exchanged among servers in order to easily check differences. It is a binary tree
in which each leaf of the tree contains the hash value of a block of data. Each parent node
contains the hash value of the hashes its child nodes.

Transferring hash trees rather than the entire dataset allow us to transfer less data.
The first check will regard the roots, if the two roots have the same value we won’t have
differences.
If we have any differences we will explore the tree and check for differences for the roots of the
subtrees and so on, until finding the partition with differences. Once identified we can transfer
data and update the value.

One important aspect is communication protocols, that allow the node to communicate so to be
aware about the status of others.
In distributed systems, each node needs to be aware about the status of the other nodes.
A fundamental challenge in any distributed system is keeping member nodes up to date on
information about other members of the system. This is not too much of a problem when the
number of nodes in a cluster is small
The number of total messages to exchange for a complete communication is equal to n*(n– 1)/2,
where n is the number of nodes of a cluster.
While the number of nodes increase, we would spend all the time only for communication, the
fully connected approach is not acceptable.
We have three typical approaches:

(b) not appliable if n is big


(a) not appliable because we would have a centralized server and so a POF.
(c) each one will know all information without communicating directly. It will be slower but it may
reduce the number of connections to establish.

Let’s see how the architecture of Cassandra work for read/write operations. In the Cassandra
Read Protocol any node can handle a client request, the node who answer will depend on several
factors.
Any node knows where an information is located thanks to the gossip protocol, it acts as a proxy.
Any node in a Cassandra cluster can handle a client request. All nodes have information about the
state of the cluster and can act as a proxy for a client, forwarding the request to the appropriate
node in the cluster.

The Cassandra Write Protocol is the same. A write request can be answered by any node, and it
knows which is the node that should be updated, the primary server with the primary replica.
If it is available the information is updated in the node but if it is not the node maintain the
request, for example in a sort of commit log.

We have an hinted handoff process that checks periodically the availability of the server that
should be updated and update the server when it is available.
What happens if the node interested in a write operation is down?
Two scenarios may happen:
1) The operation will be lost; thus the target node will be in inconsistent state after its recover
2) A specific node can be designated to receive write operations and forward them to the target
node when it becomes available.
The hinted handoff process is in charge of:
1. Creating a data structure to store information about the write operation and where it
should ultimately be sent
2. Periodically checking the status of the target server
3. Sending the write operation when the target is available
Once the write data is successfully written to the target node (primary replica), it can be
considered a successful write for the purposes of consistency and replication.
The target node will be in charge to update other eventual replicas.

Consigli Key/Value Database


Don’t use public if you have get/set, put them private.
It’s a trade-off, if we expect for our key/value db to be big enough to store all buckets we can use
the same file to improve performance.
If static we will open only one connection, it could be if later we want to do multi-threading.
Don’t make everything static. If we need more connections don’t use static.
The driver itself will manage multi-threading.
In many-to-many relationships, you must store a list of the objects, for example.
If you have a string, you must parse it every time and so on.
We can you use redundancy to reduce the number of accesses in the DB. We use it to provide fast
queries to avoid join operations.

With this we are building again a relational database.


The idea is to insert one entry for each author or specify a structured valued including the list of
authors, we retrieve all authors each time we get a book. We have redundancy but we avoid join
operations.
If we need write operations, this key/value is not suitable.
Try one part that can exploit the key/value in the project. The most useful one is the read part.
It’s better to not store complex relations in key/value databases. The main usage is caching.
A usage is to store the path of images, to quickly retrieve them. The first time they download them
from the server and then they maintain them locally.
It could be useful to store books information in a denormalized way. The usage is to reduce as
much as possible join operations.
We need to have a list of books in authors and when we have to delete a book, we have to call a
deleteBook method that delete the book from the list of books in the Author object.

When I delete an author I have to delete all books, so I can keep only that list.
In the main we will retrieve a book and we can delete it.
In the deleteBook I have those general DB methods that depend on the particular database.
When I delete an author we scan the list and delete all books and then delete the author on the
DB.

Best Practice in developing Java Application


Package names should start in lower case, while classes should start with upper case.
We need to avoid conflicts with class/interface names.
Companies use their reverse domain name as package prefix.
Conflicts within a given company must be addressed with internal conventions.
Es. www.dii.unipi.it -> it.unipi.dii…
If the internet domain name is not valid for packaging, for example it is a java keyword, the
convention is to use underscores to substitute forbidden characters..
Examples
Libraries: [rev internet domain].[division].[lib name]
Application [rev internet domain].[solution].[app name]
Framework: [rev internet domain].[framework].[module name]
Application ca be structured by organizing packages:
- By features
it.unipi.dii.inginf.lsdb.library
.admin
.login
.registration
.user
.reserve
.search
- By layers and/or by type
it.unipi.dii.inginf.lsdb.library
.bean
.cache
.config
.persistence
.factory
.model
.service
.local
.remote
All items are related to a single feature into a single package.
High modularity, minimal coupling. After logging in the next will be in the user page, but we want
to minimize the interference we may introduce between packages.
If an entire feature must be deleted, we need to delete one single directory.
Easy to read and understand how features are implemented and relations among different
features becomes easier to identify.
Package should be exploited (visibility).
If a class should be used only on its package, it’s not useful to be visible outside of it.
Pay attention to package circular dependencies, can decrease the readability and maintainability.
Items separated by layers have advantages: items doing the same type of job are placed all
together.
It’s difficult to understand the main feature, we have low modularity and high package coupling.
It’s difficult to reduce the scope of objects and editing feature will require changes across many
packages.
It’s better to mix up the two technologies: each feature package is internally organized by
layers/types.

Column DB Tips
Queries should guide the design we must carry out to design our application.
Some examples of typical queries, most of them analytical, that may led to choose column DB are:
 How many new orders were placed in the Northwest region yesterday?
 When did a particular customer last place an order?
 What orders are en route to customers in London, England?
 What products in the Ohio warehouse have fewer than the stock keeping minimum
number of items?
These are queries that require some sort of statistics, histograms, analytics.

Queries provide information needed to effectively design column family databases.


The information include:
- Entities
- Attributes of entities
- Query criteria
- Derived values
Designers start with this information and then use the features of column-based databases
management systems to select the most appropriate implementation.
Entities: we need to recall that a single row describe the instance of a single entity in a columnar
DB. Rows are uniquely identified by row keys. A row cannot have specified attributes and they can
be added later.
Starting from all possible attributes we may have in mind; we can start which of them will be used
together or not.
Attributes of entities: attributes of entities are modeled using columns.
Query criteria: the selection criteria should be used to determine optimal ways to organize data
with tables and column families and how to build partitions
Derived values: it is an indication that additional attributes may be needed to store derived data.
For each user we can decide to implement a column family for all the orders, with a value
containing how many orders he did.
Differences with relational DBs:
 Column databases are implemented as sparse and multidimensional maps
It’s not a problem to have sparsity here, for example the state just for some of them.
 Columns can vary between rows, in relational DBs we need to define the schema
 Columns can be added dynamically
 Joins are not used because data is denormalized
 It is suggested to use separate keyspace for each application

Many-to-many relationships
Let’s see how to use column families in a non-conventional way with a pattern to implement
many-to-many relationships.

In a relational DB we use 3 tables, including a join table.


In order to avoid join operations, each customer includes a set of column names that correspond
to purchased products (Ids).
This can achieve a fast read operation. We need two tables, one for the customer and one for
products. A column family may include all attributes regarding the customer’s information. For
each row we will have or not these attributes specified.
We have another column family used as a list of the bought products.
Whenever a new product we will add a new column where the name is the id of the product.
We are storing a value in the name of a column, same as putting information in a key of key/value
databases. For another customer we will have total different columns, but they will be together in
the disk specifically for that column family.
The same kind of organization can be found in the product table. But it can be more useful the fact
that we can specify a value for each column we generated with the id of the product.
Because the column value is not used for anything else, we can store the product name there.
This can avoid the access to the product table, if we want to produce a report listing products
bought by a customer.

Keeping a copy of the product name in the customer table will increase the amount of storage
used, but we improve the read performance.
This is a form of denormalization, we introduce redundancy because this info will be present in the
tables regarding the products and similarly in the other table, we can do the same with columns
regarding the id of the user.
We can use it on the analytics side using an ETL starting from another database in the other part.

Model an entity with a single row


We may have an entity, a product, that can be specialized in more products. They have some
common attributes. With a columnar DB we may represent a specific instance like a row in the
product table of a columnar DB.
We will have just one table in which the role of the column family will be to simulate this.

We will have one or more general attributes of product in a column family and some columns of
the appliance in another column family.
This information will be stored close in the disk.
A single instance of one entity, such as a particular customer or a specific product, should have all
its attributes in a single row.
In this case, typical of column DBs, some rows store more column values than others.
The figure shows the classical organization in a relational DB of an entity that can assume different
“shapes”.
In order to ensure the atomicity of the operation in column DB, the same entity than in the figure
must be organized in just one table with 4 column families.
If we add a new product we need to redesign the table, this is the limitation.
One strategy is to define at the beginning another column family ‘other_products’ so that inside
we can build any kind of column, even if we don’t exploit features totally, we will mix together
information regarding different entities.

Avoid Hotspotting in row keys


Hotspotting occurs when many operations are performed on a small number of servers,.

One server is overloaded while the three remaining servers are underloaded.
If we read information in other systems and write information in our system we write in a
sequential way and the server interested for these operations will be always the same.
We can prevent hotspotting by hashing sequential values generated by other systems, hashing
the row key in order to distribute the load.
Alternatively, we could add a random string as a prefix of the row ID to the sequential value as a
key to access to one of the servers.
Another way to avoid it is to use parallel loading of the data; we have several threads in charge to
take data from another system and different threads will work in parallel on the different servers.
We won’t have hotspots even if we continue to use lexicographic order.
With one thread in the ETL process, we may risk an hotspot.
These strategies would eliminate the effects of the lexicographic order of the source file on the
data load process.

Column Value Versions


Some column DBMSs provides for column value versions. There are some frameworks that allow
us to specify the timestamp on which data has been introduced, and when we update we do not
overwrite but we generate another record for that column.
The number of versions maintained is controlled by database parameters (min and max number
of data versions). The versioning parameters depends on application requirements.
The higher they will be and the higher will be the amount of data we will need and time too,
because we need to scan among different versions of data.
Versioning is useful if we need to roll back changes we did to column values.

Avoid Complex Data Structures in Column Values


There are some DBMS that allow us to store as a value any kind of data structure, such a JSON file.

These information will be stored in a value, but if we do this we won’t exploit the features of the
database, of putting and separating attributes that should be retrieved together.
Using separate columns for each attribute makes it easier to apply database features to the
attributes, like indexing. If we want to index one attribute we need to have data quickly
retriavable, this can be implemented using a column. If we put a JSON file inside we cannot build
an index of one element of a JSON file.
Separating attributes into individual columns allows you to use different column families if
needed.
Index: a specific data structure in a DBMS that allows the database engine to retrieve data faster
than it otherwise would.
In column databases, we can look up a column value to quickly find rows that reference that
column value.
SELECT fname, lname FROM customers WHERE state = ‘OR’;
Primary indexes are indexes on the row keys of a table (created and maintained by the DBMS).
Keys are organized in order to quickly retrieve the row key.
Secondary indexes are indexes created on one or more column value.
In order to optimize the precedent query we need a inde for the state.
Either the database system or your application can create and manage secondary indexes.

Most of DBMS allow us to identify secondary indexes, but some DBMS do not have this feature
and this should be implemented In the size of application.
General and common sense rule:
“if we need secondary indexes on column values and the column family database system provides
automatically managed secondary indexes, then we should use them.”
The code is optimized in fact.
Example: we can speedup this query
SELECT fname, lname FROM customers WHERE state = ‘OR’ AND lname = ‘Smith’ By setting an
index on the state and on the lname attributes.
Typically, the DBMS will use the most selective index first. If we have 10.000 customers in Oregon
and 150 customers with the last name Smith.
The first secondary index will be the last name, we have only 150 records with Smith and we will
look from these the records from Oregon without scanning all 10.000 records. This is automatically
done by the DBMS.
The automatic use of secondary indexes has the advantage we do not have to change our code to
use the indexes if the application requirements will change.
Queries can be added/updated in fact.

In some situations it’s better to avoid to use automatically managed indexes.

In the first situation we have few values more or less equally distributed. An index will probably
not help much, especially if there are roughly equal numbers of each value. We will in fact have
almost the same number of scans, and we need to add and manage an additional data structure.
In the second situation we have many distinct values. We will have an entry in the index for each
distinct value and this is not suitable.
Indexes may not help much here because the index will have to maintain so much data it could
take too much time to search the index and retrieve the data.
In the third situation we have an attribute highly sparse, so we have too few values associated to a
specific column. It is not worth creating and managing an index data structure for this attribute.
We have problems in following this approach.

Create and Manage Secondary Indexes Using Tables


In this case we can also think about building our own index with our needs from the side of the
application. We can create additional tables to generate our ad hoc indexes.
Suppose we have two tables and another table in which we take some columns to manage the
index.

In this case, we must explicitly create and manage tables to store data we would like to access via
the index.
The programmer will be responsible for maintaining the indexes and two main strategies may be
adopted:
a) We may perform the write operation, update the table, thew index and then return the control
to the application. Updating an index table during write operations keeps data synchronized
(they’re always updated) but increases the time needed to complete a write operation. We have
an extra time for concluding the right operation.

b) We may perform batch updates in the indexes, which means we perform write operation, we
update the table and return the control. Then there will be a thread that will update the index
table. But this can introduce periods of time when the data is not synchronized, but this may be
acceptable in some cases.

In Column DBS the atomicity of write operations is guaranteed in the level of the column family.
We cannot use it with transactions, we cannot ensure atomicity with a group of rows. We are sure
we manage to conclude the update of column families, all, or nothing.

MongoDB
mongoDB stands for “Humongous DB”
 Open-source
 Document-based
 “High performance, high scalability”
 Different configurations on CAP triangle (CP and AP mainly)
The data model is document-based (max 16MB), they’re not of infinite size. If I think an array of
transactions will grow exponentialy, maybe it’s not a good idea to embed an array of transactions
insidde a user.
Documents are in BSON format, like JSON format but in a more efficient way, consisting of field-
value pairs. With documents we have flexibility, for example we don’t have to worry if an
information is missing or is additional. Each document stored in a collection.
Collections are like tables of relational DBs, their documents do not need to have uniform
structures. They have index set in common.
JSON stands for “JavaScript Object Notation”.
 Easy for humans to write/read, easy for computers to parse/generate
 JSON document is an unordered collection of “field: value” pairs
 6 main data types (string, number, object, array, boolean, null)
 Objects can be nested, we can have embedded objects
 JSON objects fields are not ordered, but array elements are, if we have objects inside an
array the order will be maintained

BSON stands for Binary JSON


• Binary-encoded serialization of JSON-like docs
• BSON extends the JSON model to provide additional data types, ordered fields, and to be
efficient for encoding and decoding within different languages (for example between BSON and
JSON)
• The MongoDB BSON implementation is lightweight, fast and highly traversable, means that if
we have a document with many nested objects it is efficient to traverse this hierarchy.
The following are the BSON types, and in orange we have the classical JSON types:
The advantages of using documents are:
 Documents (i.e. objects) correspond to native data types in many programming languages.
 Embedded documents and arrays reduce need for expensive joins.
 Dynamic schema supports fluent polymorphism.

By default, each document contains an _id field. This field has several special characteristics:
 The value serves as primary key for collection.
 The value is unique, immutable, and may be any non-array type.
 Default data type is ObjectId, which is “small, likely unique, fast to generate, and ordered.”
 Sorting on an ObjectId value is roughly equivalent to sorting on creation time.
It’s not necessary to specify it in the schema, automatically the MongoDB server will add this field.
There is a default index for a collection that is by id, we don’t have to specify if we want to sort by
id because it is done automatically. Also, it will be auto-incremented.
High Performance
- MongoDB provides high performance data persistence.
- Support for embedded data models reduces I/O activity on database system. It can be
efficient in reading, because information can be gathered inside the document.
- Indexes support faster queries and can include keys from embedded documents and
arrays. It’s not efficient to maintain a lot of indexes.
Rich Query Language
MongoDB supports a rich query language to support read and write operations (CRUD) as well as:
 Data Aggregation
 Text Search and Geospatial Queries.
High Availability
MongoDB’s replication facility, called replica set, provides:
 automatic failover
 data redundancy
Horizontal Scalability
We shard data in a collection.
MongoDB provides horizontal scalability as part of its core functionality:
 Sharding distributes data across a cluster of machines.
 MongoDB supports creating zones of data based on the shard key.
 In a balanced cluster, MongoDB directs reads and writes covered by a zone only to those
shards inside the zone.
We do not perform a query to a specific shard but in the lower layers of MongoDB, queries will be
routed according to some needs.
The core processes in MongoDB are:
 mongod: the main daemon process for the MongoDB system. It handles data requests,
manages data access (it will handle concurrency), and database process;
 mongos: the controller and query router interfaced between client applications and the
sharded cluster; Before going to the mongod process it will intercept the query and decide
where to send the query. It manages shards.
 mongo: the interactive MongoDB Shell. The MongoDB Shell is transitioning to mongosh
which is shipped with MongoDB Compass (graphical application);

The replica architecture is the following.

Primary and secondaries are physically inside different machines. We will have the entire DB
replicated to each node.
The writing operations will be intercepted by the primary node, while the reading operations can
be redirected to secondary nodes. After updated the primary, asynchronously the secondary
nodes will ask for updates.
There will be an algorithm to elect the primary node and if it goes down, a secondary node will be
elected as the new primary node.
The following is the sharding architecture.
mongos process will intercept the query and coordinated with other mongos processes will route
them, and config servers that will configure mongos processes in order to have the same view of
the sharding architecture.
We can have different mongos for different application servers.
This is the overall MongoDB architecture, with the two architectures.

The main application will ask the driver to perform the query. The driver will send the query and
the query router will intercept the query and find the correct shard. Within the shard it will be
redirected to a primary or secondary server if it will be a write or read operation.
Main differences between MongoDB and SQL are:

In MongoDB a schema definition is not required. We can put constraints to emulate something
similar to the schema definition.
Installation requires Community MongoDB and MongoDB Compass.
The default configuration file is, on macOS, in /usr/local/etc/mongod.conf.
After installing it, it will run automatically. We consider a localhost installation where 27017 is the
default port the standalone mongod listens on. Use --port for setting a prot number different than
the default one.
#Stop the running mongoDB server daemon
mongod --shutdown
#Start the mongoDB server daemon
mongod
#Start the mongoDB server daemon on port 28018
mongod --port 28018
We can have different server deamons on the same machine in this way.
Mongo ATLAS is useful for cloud services.
The mongo shell is an interactive JavaScript interface to MongoDB.
We can use the mongo shell to query and update data as well as perform administrative
operations.
To connect to a MongoDB instance running on localhost with a non-default port 28015:
mongo --port 28015
To connect to a MongoDB instance running on a remote server:
mongo --host mongodb0.example.com --port 28015
We can check the list of databases:
show databases (or show dbs)
We can display the database currently used:
db
We can switch databases (<database> is the name of a specific database):
use <database>
We can create both the database myNewDatabase and the collection myCollection during the
insertOne() operation (if they do not exists):
use myNewDatabase
db.myCollection.insertOne( { x: 1 } );
To look for all documents in a collection:
db.myCollection.find()
We can see the list of collections:
show collections
If we insert a document with two fields with the same name: {x:5, x:10}, we will have the last value
for that field. We cannot have in fact duplicate fields, the second field will overwrite the first one.
In MongoDB Compass we will first have the URI of the connection to connect to the cluster. We
can connect our DB to the ATLAS.
We can also create indexes for a field and specifying the order we want. In Explain Plan we can
evaluate the performance of our queries. This is a useful tool to design indexes.
In Aggregation we can sort by field ($sort {pop: -1}, descending by population) and then limit the
result on the first ($limit 1), and we will see the pipeline.
Graph Databases
Use graph databases only for network structures or something similar.
A graph is a collection of vertexes and edges.
A vertex represents an entity and for entity we consider an instance of an entity in this
environment.
In the figure, we show instances of an entity (city) represented by a set of vertices. We are in the
context of highways networks.

The properties that the vertices, namely the cities, may have are:
• Population
• Coordinates
• Name
• Geographic Region
The relationships are specified among edge and are among two instances of the same or different
entities.

The properties that the edge in the highways network, may have are:
- Distance
- Speed limit
- Number of lanes
A Weight is a commonly used property of edge.
It represents some value about the strength of the relationship (cost, distance, etc.)

Directed edges have a direction, whereas undirected edge does not have a direction.
The presence or not of a direction in edges depends on specific applications.
An undirected graph can be always specified in terms of two directed edges. If we have just the
same value for both, it’s better to use just one to reduce the amount of memory to use.
Example: Highways Network Graphs Example: Family Relation Graphs

A path through a graph is a set of vertices along with the edges between those vertices and allow
us to move from an entity to another.
Paths allow us to capture information about the relationships of vertices in a graph.
There may be multiple paths between two vertices. In this case (highways networks, for instance),
often the problem of finding the best (the shortest, the cheapest, etc.) path is taken into
consideration, if we have a weighted graph.
All elements In the path represents ancestors of one node.

A loop represents a relation that an entity has with itself.

Proteins, for example, may interacts with proteins of the same type.
The presence of loop may not have sense in some specific domains, such as family tree graphs.

The union of two graphs is the set of edges and vertexes put together from the two graphs.
A graph is not always connected, so if there’s no edge in common, we may only have vertexes.
The intersection of two graphs is the set of edges and vertexes in common of the two graphs.
Graph traversal is the process of visiting all vertices in a graph.
The purpose of this is usually to either set or read some property value in a graph, in edges or
vertexes.
The traversal of the graph is often carried out in a particular way, such us minimizing the cost of
the path.
Traversing all the graph may be expensive, if our query need traversal we need to analyze if this
traversal is needed or not and if using a graph DB is a good solution.
Two graphs are isomorphic if:
 for each vertex in the first graph, there is a corresponding vertex in the other graph
 for each edge between a pair of vertices in the first graph, there is a corresponding edge
between the corresponding vertices of the other graph.
We have the same vertexes and relations among the same but relations that may be different.

Detecting isomorphism in graphs or sub graphs allows us to identify useful patterns.


For example using two social networks we can see if we have isomorphism to understand if there
are people that work together and are also friends.
Order: the number of vertices
Size: the number of edges
These two measures are very important to evaluate timefor evaluating queries on the graph and
memory occupancy required to handle specific operations.
Example: to find the largest subset of people in a social network that know each other.
The higher the order and the size of the graph, the harder will be to solve the query!
Degree: is the number of edges linked to a vertex. It measures the importance of any given vertex
in a graph.
Example: Consider a person with many family members and friends that he/she sees regularly
(high degree vertex). What if that person contracts a contagious disease?
Closeness: is the measure of how far the vertex is from all others in the graph. It can be calculated,
in a fully connected graph, as the reciprocal of the sum of the length of the shortest paths
between the node and all other nodes in the graph.
It could tell how fast the disease can be transmitted.
It is useful to evaluate the speed of information spreading in a network, starting from a specific
node.
Betweenness of a vertex is a measure of how much a vertex represents a bottleneck in a graph.

In the example above, both vertices 1 and 2 will have high betweenness levels. Indeed, they form
a bottleneck in the graph and if one of them were removed, it would leave the graph
disconnected.

A flow network is a direct graph. It adopts capacitated edges.


It Is typically used in electronic systems.
Each vertex has a set of incoming and outcoming edges.
In each vertex the sum of the capacities of incoming edges cannot be greater than the sum of the
capacities of outcoming edges.
The unique exception to the previous rule regards the source and the sink nodes, where for
example we can have only outcoming or incoming edges.
In bipartite graphs there are two distinct sets of vertices, two types of entities with different
instances.
Each vertex is connected only with vertices of the other set.
It can be used for modeling relationships between students and teachers.

A multigraph is characterized by the presence of multiple edges between vertices.


In the example above, each edge may represent a different way (by plane, by truck, etc.) to
shipping item between the cities.
Each edge can be equipped with different sets of properties.

A graph is a collection of Vertices and Edges. It will be persistent on the disk and managed by a
DBMS.

Vertex: specifies an entity. Usually, entities in a graph belongs to the same category.
Edge: specifies a relation between two vertices. Relations may be long terms or short terms.
Vertices and edges may have properties.
A highway could be a single edge between two cities, in which case it models the road traffic in
both directions.
Alternatively, a graphical representation could use two edges, one to represent travel in each
direction (for example, east to west and west to east).
Issue: Which is the “right way” to model highways? Answer: It depends!
- If the goal is to model distance and approximate travel times between cities, then a single
edge might be sufficient.
- If we are interested in more detailed descriptions of highways, such as direction, number
of lanes, current construction areas, and locations of accidents, then using two edges is a
better option. We need to avoid to put too much information in the graph.

Entities are people and we have an edge for each meeting.


Regarding entities we can store the infection state and the probability to infect in edges, for
example regarding when I met them.
Analyzing paths of the entire graph we can calculate the probabilities, for example multiplying.
Part-of relations graph:
 The upper vertex is often called the parent vertex
 The lower vertices are called children vertices
 Parent vertices can have multiple children vertices.

The bipartite graph can model social medias’ interactions.

Graph databases adopt vertices and edges to store explicit relations between entities.
In relational databases, connections are not represented as links. Instead, two entities share a
common attribute value, which is known as a key.
Join operations are used in relational databases to find connections or links.
When dealing with huge amount of data (several tables with too much rows) join operation
became computationally expensive.
Example of query: list all the course a particular student is enrolled in.
In relational models, we need a join procedure. The edges between students and courses allow us
to quickly query the databases.
Instead of making joins, at low level we have pointers in the object that points the address of the
object describing the course.
If the entries grow too much the join operation is really expensive.

Graph databases allows to modeling many-to-many relations in a easier way than relational
databases.

Edges allow us to explicitly modelling many-to-many relations, rather than using tables.
The amount of entries would be huge in a social network.

Different type of relations between entities may be handled exploiting multiple type of edges.
Multiple type of edges can be simply obtained specifying different values of edge properties
When dealing with NoSQL databases, it is not easy to decide which solution should be adopted. In
general, data base designer should have in mind the main kinds of queries that the application will
perform on the data. Graph databases are suitable for problem domains easily described in terms
of entities and relations. Typical queries are:
How many hops (that is, edges) does it take to get from vertex A to vertex B?
How many edges between vertex A and vertex B have a cost that is less than 100?
How many edges are linked to vertex A? (degree)
What is the centrality measure of vertex B?
Is vertex C a bottleneck; that is, if vertex C is removed, which parts of the graph become
disconnected?
In the previous examples of queries, there is less emphasis on:
 selecting by particular properties (how many vertices have at least 10 edges?). To make
this query we need to traverse all the graph.
 aggregating values across a group of entities (selects all customer orders from the
Northeast placed in the last month and sum the total value of those orders)
Moreover, the previous queries are fairly abstract (computer science/math domain).
Designer needs to translate queries from the problem domain to the domain of experts working
on graphs.

Let’s suppose to design a DB for handling movies, actors and directors.


Let’s suppose to offer the following services for users:
1. Find all the actors that have ever appeared as a co-star in a movie starring a specific actor
(for example: Keanu Reeves).
2. Find all movies and actors directed by a specific director (for example Andy Wachowski).
3. Find all movies in which specific actor has starred in, together with all co-stars and
directors
They could be implemented using a relational database:

Problems:
 We need to make several join operations.
 SQL lacks the syntax to easily perform graph traversal, especially traversals where the
depth is unknown or unbounded.
 Performance degrades quickly as we traverse the graph. Each level of traversal adds
significantly response time to a specific query.

In the following, we show a SQL query for finding all actors who have ever worked with Keanu
Reeve.

We don’t know the dimension of the graph and also it can grow a lot.

We can exploit the Acted in Role property of the edge to specify the role. Exploiting the graph
structure we can build the db. At low level we will just traverse the graph exploiting pointers.
We will satisfy this query in this way:
If we need analytics query and this kind of query, we can use MongoDB and GraphDB and we will
have to work on the consistency of the DB. We will have a server for MongoDB and a server for
neo4j and the application will deal with it.

Social Network for NOSQL Database Developers


Let consider the database developer as the main actor of the social network.
The main use cases may be:
1. Join and leave the site
2. Follow the postings of other developers
3. Post questions for others with expertise in a particular area
4. Suggest new connections with other developers based on shared interests
5. Rank members according to their number of connections, posts, and answers (we can do it
using the degree and not traversing)
We identify entities and their properties:
Developers: Name, Location, NoSQL databases used, Years of experience, Areas of interest.
It’s not mandatory to put al them in a graph DB.
Posts: Date created, Topic keywords, Post type (for example, question, tip, news), Title and Body
of post.
It’s not right to have only the id in the node and access in other DBs for other information. We
have to put all information we need in the graph.
Considering the two entities in our domain, the following candidate relations, but we talk about
different relationships than relational DBs but relationships in the graph, can be analyzed:
 Developer - Developer
 Developer - Post
 Post - Developer
 Post - Post
Notice that, instead of considering all the possible combinations for the identification of candidate
relations, as the number of entities grows, it is better to focus on combinations that are
reasonably likely to support queries related to the application requirements.
If a developer follows another developer, it is reasonable that he/she is interested in reading
his/her posts and also on the posts of the friends of its friends. The depth of the path, namely how
many edges to traverse for identifying a followed developer, is a parameter that can be defined
and handled at the level of application code.
It’s also possible to exploit paths to suggest developers to follow.
We can also have an edge ‘is_friend’ between followed people to add other features, like letting
him know when someone’s birthday will be.
Here we have a Developer and a Post with two directed edges (a) or one edge not direct.

We can have different meanings on the two directions.


A post may be created in response to another post!

Remember that queries drive the design of graph databases.


Domain-specific queries must be translated into Graph-centric queries, after the identification of
entities and relations between entities.

If we have someone that connects two groups of developers, we have two disconnected groups
and posts of a group will not be seen by the other.
Basic steps for designing GraphDB:
- Identify the queries you would like to perform.
- Identify entities in the graph, among all entities we have.
- Identify relations between entities
- Map domain-specific queries to more abstract queries
- Implement graph queries and use graph algorithms. We need to exploit algorithms and in
particular the optimized form.
Some advice are:
1. Pay attention to the dimension of the graph: the bigger the graph the more computational
expensive will be performing queries.
2. If available, define and use indexes for improving retrieval time. We need to take care of it,
to keep it updated we may traverse all the graph.
3. The higher the number of edges, the higher the occupied memory. Use appropriate type of
edges:
a. In case of symmetrical relations, use undirected edge, otherwise use direct edges.
b. Properties requires memory! If possible, use simple edge without properties or few,
but do not exploit another DB just to store details. Also pay attention to the data
type of the properties.
4. Watch for cycles when traversing graphs: in order to avoid endless cycles, use appropriate
data structure for keeping track of visited nodes.

Pay attention if we need to start a traversal at vertex A and then


followed a path through vertices B, C, D, E, and F, you would end up
back at A.
If there is no indication that we have already visited A, we could find us
in an endless cycle of revisiting the same six nodes over and over again.
If we know that there will be cycles use this information in the code,
remembering what you’ve already visited.

Most of the famous implementation of graph databases do not consider the deployment on
cluster of servers (no data partitions nor replicas). Indeed, they are designed for running on a
single server and can be only vertically scaled.
Thus, the developer must take a special attention to the growing of:
 The number of nodes and edges
 The number of users
 The number and size of properties
Since last year the community version of Neo4j, it’s possible to have replicas of the graph.
Execution Times of Two Typical Algorithms:

Increasing the number of vertexes, the execution time explodes with the maxima clique, the
largest group of people who all follow each other.
Parent POM
A parent pom is a pom file that can be imported in the application.

We can have project inheritance or project aggregation.


In case of inheritance a pom is shared in the organization, but the super-pom doesn’t know who
will inherit from it, while in project aggregation the parent pom is aware about the modules that
are inheriting from it. The packaging must be specified if we want to implement the project
aggregation.
In modules we have the reference to artifactIds of modules and in the parent section of sons must
contain information about the parent pom.

We have an application composed of 4 different modules aggregated in the parent pom. We want
to provide the service for books and music, so we use two modules in common and the two
modules separated.
If one of the libraries part of the framework requires another library of the same framework, you
don’t import it manually.
We have the parent containing 4 modules, and we have that the login requires the cache library.
If we want to upgrade login utils version to the latest but also, we have an upgrade of the cache
module, if we’re forced to specify the version of cache and login util in the POM there’s a high
probability to update util without updating cache. If we use the multi-module function, we
upgrade the dependency of the parent and not of the single module, importing the coherent set.
If we have an upgrade of two modules, it will work correctly.

API & SPI

We don’t want to show outside the PriceGridServiceImpl but just the GridService interface.
Objects that we want to be accessible outside have public scope, otherwise package scope.
Removal is a problem because if it was public, appications using it will have compilation errors.
Rather than removing an API item it’s better to declare it deprecated, we keep the retro-
compatibility.

It is a set of items, for example an interface, a library is exposing but that’s not implemented in the
library because it’s up to the application to implement it based on the logic of the application.
The library could invoke the implementation of that.
We can modify the behavior of the library doing it.
If we add something mandatory, upgrading library version we need to add something to make it
work.
It’s important to separate API & SPI:

We can run a plugin in the pom to parse the source files and do this transformation.
If we want to distribute a library, it’s not a good idea.

We can specify to not obfuscate the API of the SPI.


Unit Test

Regression: bug
Test always edge cases.

Acceptance tests are made from the client.


Integration tests test the integration of components.
Unit tests are developed by Maven at building phase.
fixture: test class
Use GIVEN, WHEN, THEN convention.

We can fix a minimum coverage of the code and confirm the test only if it’s greater than this.
A common approach is:
@Test for test methods.
@BeforeEach every time before a @Test is executed.
@BeforeAll at the beginning.
Same for after.
We have to increase the visibility scope of the constructor removing private and flagging it with
that annotation.
The init method whenever one of these tests is executed before them creates a new instance of
the system under test.
We create a new object because we want to test in isolation.
Being the package name the same production code and test code can expoit the package visibility.
The one above is the original code.
After the refactory we have the class in the right:

IEngine contains only start.


MockEngine is test code. In test method we allocate instance of our mock, of instance under test.
In this way I’m able to test if I used the EngineStarter to start the engine, it is started.

They build at run-time the mock and we just specify the behavior we expect.

Whenever my mock is invoked in the method getDataById passing any possible string then return
me a new data allocated constructed in this way.
We have two fields, a mock and a service under test.
We initialize the mock and allocate an instance of the sut.
The first test is the one that the singleton pattern is working.
We check instances returned by the getInstance methods are the same.
CRUD operations on MongoDB
Create or insert operations add new documents to a collection.
If the collection does not currently exist, insert operations will create the collection.
In MongoDB, insert operations target a single collection.
Only aggregation will allow us to consider more than one collection.
All write operations in MongoDB are atomic on the level of a single document.
In MongoDB, each document stored in a collection requires a unique _id field that acts as a
primary key.
If an inserted document omits the _id field, the MongoDB driver automatically generates an
ObjectId for the _id field.
The insertOne() statement allow us to insert a document.

Db.NameCollection.insertOne().
The following example inserts a new document into the inventory collection.

If the document does not specify an _id field, MongoDB adds the _id field with an ObjectId value
to the new document.
We will receive the acknowledgment with the id associated to the object if everything worked.
insertMany() can insert multiple documents into a collection.
To this aim, an array of documents must be passed to the method.
The following example inserts three new documents into the inventory collection.
We have to specify a vector of documents.

By default, documents are inserted in order.


In the following, we show an example for querying all documents of a specific collection.

The query corresponds to the SELECT * FROM inventory in SQL language.


To specify equality conditions, use <field>:<value> expressions as parameters for the find()
function.

This operation corresponds to the following SQL statement:


SELECT * FROM inventory WHERE status = "D"

A query filter document can use the query operators to specify conditions in the following form:
{ <field1>: { <operator1>: <value1> }, ... }
The following example retrieves all documents from the inventory collection where status equals
either "A" or "D":

The operation corresponds to the following SQL statement:


SELECT * FROM inventory WHERE status in ("A", "D")

The following example retrieves all documents in the inventory collection where the status equals
"A" AND qty is less than ($lt) 30:

The operation corresponds to the following SQL statement:


SELECT * FROM inventory WHERE status = "A" AND qty < 30

The following example (last query in the figure) retrieves all documents in the collection where the
status equals "A" OR qty is less than ($lt) 30:

The operation corresponds to the following SQL statement:


SELECT * FROM inventory WHERE status = "A" OR qty < 30
In the following example, the compound query document selects all documents in the collection
where the status equals "A" and either qty is less than ($lt) 30 or item starts with the character p:

The operation corresponds to the following SQL statement:


SELECT * FROM inventory WHERE status = "A" AND (qty < 30 OR item LIKE "p%")
These are all query operators:

For replacing documents:


 db.collection.updateOne(<filter>, <update>, <options>)
 db.collection.updateMany(<filter>, <update>, <options>)
 db.collection.replaceOne(<filter>, <update>, <options)
We can specify in these:
 The <filter> document is defined as a query that we previously analyzed and specify the set
of document to update.
 The <update> document specify the modification to apply.
 The <options> document specify a set of parameters for the modifications.

To use the update operators, we need to pass to the update methods an update document as
follows:

It specifies the modification to apply.


Some update operators, such as $set, will create the field if the field does not exist.
This is the list of update operators:

The db.collection.updateOne() method on the inventory collection update the first document
where item equals "paper":

The $set operator updates the value of the size.uom field to "cm" and the value of the status field
to "P".
The $currentDate operator updates the value of the lastModified field to the current date. If
lastModified field does not exist, $currentDate will create the field.

The db.collection.updateMany() method on the inventory collection update all documents where
qty is less than 50:

To replace the entire content of a document except for the _id field (it is immutable), pass an
entirely new document as the second argument to db.collection.replaceOne().
When replacing a document, the replacement document must consist of only field/value pairs;
i.e. do not include update operators expressions.
The replacement document can have different fields from the original document.
The following example replaces the first document from the inventory collection where item:
"paper":
We will continue to have the item but also this new field and not the old fields.

To delete all documents from a collection, pass an empty filter document {} to the
db.collection.deleteMany() method.
The following example deletes all documents from the inventory collection:

We can specify criteria, or filters, that identify the documents to delete. The filters use the same
syntax as read operations.
To specify equality conditions, use <field>:<value> expressions in the query filter document:
{ <field1>: <value1>, ... }
A query filter document can use the query operators to specify conditions in the following form:
{ <field1>: { <operator1>: <value1> }, ... }
The following example removes all documents from the inventory collection where the status field
equals "A":

To delete at most a single document that matches a specified filter (even though multiple
documents may match the specified filter) use the db.collection.deleteOne() method.

We can use the mongoexport command-line tool to produce a JSON or CSV export of data stored
in a MongoDB instance. Run mongoexport from the system command line, not the mongo shell.
To export a specified collection to a specified output file from a local MongoDB instance running
on port 27017 use the following command:
mongoexport --collection=<collectionName> --db=<dbName>
--fields=<field1[,field2]> --out=<filename.json(-csv)> --type=<json/csv>
If we want to export from a specific host and port, use the following command
mongoexport --host= <hostAddress> --port=<portNumber>--collection=<collectionName>
--db=<dbName> --fields=<field1[,field2]> --out=<filename.json(-csv)> --type=<json/csv>

The mongoimport tool imports content from an Extended JSON, CSV, or TSV export created by
mongoexport, or potentially, another third-party export tool.
Run mongoimport from the system command line, not the mongo shell.
Starting in MongoDB 4.2, mongoimport expects import data to be in Extended JSON v2.0 by
default.
Mongoimport only supports data files that are UTF-8 encoded. Using other encodings will produce
errors.
Importing the order is not maintained, to do that we have to use a specific option.
In-memory Databases
The design of a database solution is driven by the application requirements.
Thus, if our application does not need to handle huge amount of data and to scale, both vertically
and horizontally, we can think about adopting different solutions that the ones discussed so far.
In-memory databases may be suitable for applications whose handled data may fit all in the
memory of a single server.
Queries will be in main memory and not in the disk and we will have low latency.
Moreover, there also can be databases that can reside in the memory capacity of a cluster (if
replication is needed). We can distribute the database and using sharding.
If we lose electricity we have to guarantee the reachability of data.
Very fast access to the data, avoiding the bottleneck of the I/O transfers.
Suitable for application that do not need to write continuously to a persistent medium.
Ad-hoc architecture for being aware that its data are resident in memory and for avoiding the data
loss in the event of a power failure (alternative persistency model).
Cache-less architecture: despite traditional disk-based databases, a memory caching system is not
need (everything is already in memory!!!).
Solutions for Data Persistency are the following.
In the following, we show some solutions for avoiding data loss in case of system failure:
 Writing complete database images (called snapshots or checkpoints) to disk files
frequently or in front of a request.
 Writing out transaction/operation records to an append-only disk file (called a transaction
log or journal) and then return the control to the application. It is similar to a commit log
and it can be used to avoid losing information in case of a failure.
 Replicating data to other members of a cluster in memory.
These are the general solutions for data persistency.
The architecture of the solutions are the one we will see now.

TimesTen is a solution produced By Oracle.


It implements a fairly familiar SQL-based relational model.
All data is memory resident.
Persistence is achieved by writing:
- periodic snapshots of memory to disk
- Writing to a disk-bases transaction log following transaction commits.
We can have a return of the control of the application directly after the commit or after
writing the transaction log, there’s an access to the disk in writing to the log in fact.
By default, the disk writes are asynchronous! To ensure ACID transactions the database can be
configured for synchronous writes to a transaction log after each write (performance loss).
This is TimesTen Architecture.
When the database is started, all data is loaded from checkpoint files into main memory (1). The
application interacts with TimesTen via SQL requests that are guaranteed to find all relevant data
inside that main memory (2). Periodically or when required database data is written to checkpoint
files (3). An application commit triggers a write to the transaction log (4), though by default this
write will be asynchronous so that the application will not need to wait on disk. The transaction
log can be used to recover the database in the event of failure (5).
The application can interact with the architecture in memory with SQL.
On the disk we have the checkpoint files in which we make snapshot with a certain frequency.
The transaction log contains only the information that at this timestamp there was a transaction
on this location with this value.
We may decide to update the transaction log after each transaction or in an asynchronous way,
returning the control to the application and then make an eventual consistency.
In this way we do not ensure the ACID transaction, it’s not durable in the time interval between
the transaction commit and the actual write to the file.
If we want to be sure to have durability of transaction, after each transaction we have to write the
commit log.
This means that we will lose in terms of performances.
The periodic checkpoints can be done at a frequency decidable by the administrator.
The checkpoint files are used when we switch off the system, while the transaction log can be
used when we have a failure.

Redis is an in-memory key-value store.


Objects consist mainly of strings and various types of collections of strings, such as lists and hash
maps.
It can be used to create queues of requests extractable by the server and the results may be
inserted too in another queue.
Only primary key lookups are supported (no secondary indexing mechanism allowed).
Redis may exploit a sort of “virtual memory” mechanism for storing an amount of data larger than
the server memory. Old keys are swapped out to the disk and recovered if needed (lost of
performance accessing continually to the disk swapping-in and out). There can be a mechanism to
decide which keys must be swapped-out.
Although Redis was designed to hold all data in memory, it is possible for Redis to operate on
datasets larger than available memory by using its virtual memory feature. When this is enabled,
Redis will “swap out” older key values to a disk file. Should the keys be needed they will be
brought back into memory. This option obviously involves a significant performance overhead,
since some key lookups will result in disk IO.
Redis uses disk files for persistence:
- The Snapshot files store copies of the entire Redis system at a point in time. Snapshots can
be created on demand or can be configured to occur at scheduled intervals or after a
threshold of writes has been reached. A snapshot also occurs when the server is shut
down.
- The Append Only File (AOF) keeps a journal of changes that can be used to “roll forward”
the database from a snapshot in the event of a failure. Configuration options allow the user
to configure writes to the AOF after every operation, at
This is the Redis Architecture:

Redis supports asynchronous master/slave replication.


If performance is very critical and some data loss is acceptable, then a replica can be used as a
backup database and the master configured with minimal disk-based persistence.
The application interacts with Redis through primary key lookups that return “values”—strings,
sets of strings, hashes of strings, and so on (1). The key values will almost always be in memory,
though it is possible to configure Redis with a virtual memory system, in which case key values
may have to be swapped in or out (2). Periodically, Redis may dump a copy of the entire memory
space to disk (3). Additionally, Redis can be configured to write changes to an append-only journal
file either at short intervals or after every operation (4). Finally, Redis may replicate the state of
the master database asynchronously to slave Redis servers (5).
We can have one part of the disk dedicated for swapping (keys not used anymore), one for
snapshots, and we can write in AOF information about transactions.
If performance is very critical and some data loss is acceptable, then a replica can be used as a
backup database.
In case of write operations, we have to implement an eventually consistency.
In this case, the master is configured with minimal disk- based persistence.
Redis may replicate the state of the master database asynchronously to slave Redis servers.
Although Redis was designed from the ground up as in an in-memory database system,
applications may have to wait for IO to complete under the following circumstances:
 If the Append Only File is configured to be written after every operation, then the
application needs to wait for an IO to complete before a modification will return control.
 If Redis virtual memory is configured, then the application may need to wait for a key to be
“swapped in” to memory.

HANA DB is a relational database, product of SAP, which combines an in-memory technology with
a columnar storage solution. It must be installed on an optimized hardware configuration.
Classical row tables are exploited for OLTP operations and are stored in memory.
Columnar tables are often used for also OLAP operations. By default, they are stored in the disk
and loaded in memory by demand. However, the database may be configured to load in memory a
set of columns or entire tables since the beginning.
Data in the row store is guaranteed to be in memory, while data in the column store is by default
loaded on demand. However, specified columns or entire tables may be configured for immediate
loading on database startup.
The persistence architecture of HANA uses the snapshot and journal file pattern found in Redis
and TimesTen. HANA periodically snapshots the state of memory to Savepoint files. These
Savepoints are periodically applied to the master database files.
ACID transactional consistency is enabled by the transaction “redo” log. As with most ACID-
compliant relational databases, this log is written upon transaction commit, which means that
applications will wait on the transaction log IO to complete before the commit can return control.
HANA’s columnar architecture includes an implementation of the write-optimized delta store
pattern. Transactions to columnar tables are buffered in this delta store. Initially, the data is held
in row-oriented format (the L1 delta). Data then moves to the L2 delta store, which is columnar in
orientation but relatively lightly compressed and unsorted. Finally, the data migrates to the main
column store, in which data is highly compressed and sorted.
The architecture is the following:

On start-up, row-based tables and selected column tables are loaded from the database files (1).
Other column tables will be loaded on demand. Reads and writes to row-oriented tables can be
applied directly (2). Updates to column-based tables will be buffered in the delta store (3), initially
to the L1 row-oriented store. Data in the L1 store will consequently be promoted to the L2 store,
and finally to the column store itself (4). Queries against column tables must read from both the
delta store and the main column store (5).
The row store is the relational database. We have snapshots of it in SavePoint.
From this store we make classical read/write operations; it is used for simple queries.
Then the system may be interested into writes to a columnar part, it decides it in fact where to
write data.
We move towards a more compressed and better organized columnar store.
Queries go towards to the three databases. The first will go to the column store, if data is missing
we go to the L2 delta store and if the same we go to L1 delta Store.

Persistence is achieved with:


- Images of memory are copied to save points periodically (6). The save points are merged
with the data files in due course (7).
- When a commit occurs, a transaction record is written to the redo log (8) (on fast SSD).
The Oracle 12c architecture is the following.

Data is maintained in disk files (1) but cached in memory (2).


An OLTP application primarily reads and writes from memory (3).
Any committed transactions are written immediately to the transaction log on disk (4).
Row data is loaded into a columnar representation for use by analytic applications (5).
Any transactions that are committed once the data is loaded into columnar format are recorded in
a journal (6), an append-only file.
We use this, not for recovering in case of failure for which we use transaction log, but in case we
use analytic queries but we don’t have all information in columnar store.
Analytic queries will consult the journal to determine if they need to read updated data from the
row store (7) or to rebuild the columnar structure.

Guidelines for Selecting a Database


Whenever we are asked to design and develop a specific software application, a set of steps must
be performed:
1) Requirements elicitation from customer
2) Requirements definition, both functional and non functional
3) Use case definition (better if with UML diagrams)
4) Identification of the main data structures (including the main procedures) and of the relations
among them (Analysis Workflow)
5) Refinement of 4) (Project Workflow, including DB Design)
6) Implementation and test
7) Deploy

We can also exploit an agile approach.


The role of the Architect/Engineer is to Identify the most suitable:
1) Software/Hardware architecture
2) Programming languages and development environments
3) Database management systems
Everything must be driven by:
1) Requirements
2) Common sense
3) Experience

An Example of Decision Support Systems (DSS).


Description by words:
The main purpose of the system is to monitor comments of users published on the public social
media pages of commercial activities, such as restaurants, hotels, shops and malls, maintained
either by the owners or by the marketing managers.
Main Actors:
• Customer (the owner of the commercial activity or his/her delegate)
• Social Account Manager
• System Administrator
The system has to continuously performs the following operations:
(i) Real-time fetching of comments on posts published by a commercial activity on its public pages
on social media.
(ii) Analysis of such comments, in order to extract the expressed sentiment, i.e., positive polarity,
negative polarity or neutral comments.
(iii) Notification to the social activity owner (or manager) when negative comments is received
and/or negative trends are recognized, in order to, adequately reply.
(iv) Generation of reports and statistics regarding both (i) the sentiment extracted from a specific
share of voice on social media and (ii) the overall reputation of a product and/or the brand of the
company.
The system must allow the users to carry out basic functions such us:
(i) to upload posts, comments, pictures and videos
(ii) to download and save post and comments (setting appropriate search parameters),
(iii) to analyze the trend of the number of likes or followers for a specific page or channel.
The DSS software platform must manage several accounts on social networks, even belonging to
the same social network. The pair (account, social network) is referred to as social channel.
Relational databases are still useful. They do the work for which they were designed: protecting
the data integrity and reducing the risk of anomalies (immediate consistency and ACID
transactions support).
• Relational databases will continue to support transaction processing systems and, in general,
OTPL applications.
• Decades of experience with transaction processing systems and data warehouses has led to best
practices and design principles that continue to meet the needs of businesses, governments, and
other organizations.
On the basis of our requirements, analysis and project workflows, we can choice among:
• Relational databases, such as PostgreSQL and MySQL,
• Key-value databases, such as Redis, Riak, and DinamoDB
• Document databases, such as MongoDB and CouchDB
• Column family databases, such as Cassandra and HBase
• Graph databases, such as Neo4j and Titan
The functional requirements are related to the type of queries that describe how the data will be
used. Moreover, they also influence the nature of relations among entities and the needs of
flexibility in data models.
When choosing a database, we have to analyze other factors (non functional requirements) such
as:
• The volume of reads and writes
• Tolerance for inconsistent data in replicas
• Availability and disaster recovery requirements
• Latency requirements
Key/value databases are preferable when:
• The application has to handle frequent but small read and write operations.
• Simple data models have been designed
• Simple queries are needed (no complex queries and filtering on different fields)
These databases do not allow too much flexibility and the possibility of making queries on
different attributes.
Examples of applications that may exploit these databases:
• Caching data from relational databases to improve performance
• Tracking transient attributes in a web application, such as a shopping cart
• Storing configuration and user data information for mobile applications
• Storing large objects, such as images and audio files
Document databases are suitable for applications that need:
• To store varying attributes, in terms of number and type
• To store and elaborate large amount of data
• To make complex queries (on different types of attribute), to use indexing and to make advanced
filtering on attributes.
Flexibility is the main feature of document databases, along with high performance capabilities
and easy of use.
Embedded documents allow to use few collections of documents that store together attributes
frequently accessed together (denormalization).
Examples of applications that may exploit these databases:
• Back-end support for websites with high volumes of reads and writes
• Managing data types with variable attributes, such as products
• Tracking variable types of metadata
• Applications that use JSON data structures
• Applications benefiting from denormalization by embedding structures within structures
Column databases are suitable for applications that:
• Require the ability to frequently read/write to the database
• Are geographically distributed over multiple data centers
• Can tolerate some short-term inconsistency in replicas
• Manage dynamic fields
• Actually have to deal with huge amount of data, such as hundreds of terabytes
Column Databases are normally deployed on clusters of multiple servers on different data centers.
If data handled by the application is small enough to run with a single server, then avoid these
databases.
Examples of applications that may exploit these databases:
• Security analytics using network traffic and log data mode
• Big Science, such as bioinformatics using genetic and proteomic data
• Stock market analysis using trade data
• Web-scale applications such as search
• Social network services
Graph databases are suitable for applications that:
• Need to represent their domain as networks of connected entities
• Handle instances of entities that have relations to other instances of entities
• Require to rapidly traverse paths between entities.
Some considerations:
• Two orders in an e-commerce application probably have no connection to each other. They
might be ordered by the same customer, but that is a shared attribute, not a connection.
• A game player’s configuration and game state have little to do with other game players’
configurations. Entities like these are readily modeled with key-value, document, or relational
databases.
• Let consider proteins interacting with other proteins or employees working with other
employees. In all of these cases, there is some type of connection, link, or direct relationship
between two instances of entities.
Examples of applications that may exploit these databases:
• Network and IT infrastructure management
• Identity and access management
• Business process management
• Recommending products and services
• Social networking
When choosing a solution for data storage and management, we are not obliged to use just one
type of database management systems.
Since different types of SQL and NoSQL DBMSs have different features, that can be usefully
together in a specific application, they may be used concurrently.
Some use cases:
• Large-scale graph processing, such as with large social networks, may actually use column family
databases for storage and retrieval. Graph operations are built on top with a graph in-memory
database.
• OLTP handled with a Relational Database (for ACID transactions), OLAP in charge to MongoDB
for fast analytics.
We need to handle consistency in any case.
Mongo DB Query
The find() method returns a cursor to the results.
In the mongo shell, if the returned cursor is not assigned to a variable using the var keyword, the
cursor is automatically iterated to access up to the first 20 documents that match the query.
To manually iterate over the results, assign the returned cursor to a variable with the var keyword,
as shown in the following slide.
The mongo shell and the drivers provide several cursor methods that call on the cursor returned
by the find() method to modify its behavior.

We assign to myDocument the first document pointed by the cursor and then the cursor will be
incremented by one. If we actually have a value, we will get the item of the JSON Document.
We transform the content to json and obtain the result.
The result will be journal, then notebook and so on until the end when we will get null.
The limit() method limits the number of documents in the result set (if we don’t want 20 elements
in the result). The following operation returns at most 3 documents in the collection inventory.

The skip() method controls the starting point of the results set. The following operation skips the
first 3 documents in the inventory collection and returns all remaining documents:

We will see the last two objects and the first three objects will be skipped.
The sort() method orders the documents in the result set. By default they will be sorted by _id,
considering the order of insertion. The following operation returns documents in the inventory
collection sorted in ascending order by the qty field:
The following operation returns documents in the inventory collection sorted in descending order
by the qty field:

We can even use more than one field.


The following operation returns documents in the inventory collection sorted in ascending order
by the status field (at first) and in descending order by the qty field:

We will order by status, and with objects with the same status we will have a descending order by
qty.
The previous cursor methods can be combined as follows:

The db.inventory.find().sort({qty:-1}).skip(2) query sorts the collections in descending order


considering the qty field but skips the first two documents.

The method db.collection.count(query, options) returns the count of documents that would match
a find() query for the collection.
The method db.collection.count does not perform the find() operation but instead counts and
returns the number of results that match a query.

In Mongo 5 the db.collection.count has been deprecated. Use db.collection.countDocuments.


They have few differences.
Query on Embedded/Nested Documents
We can use use the query filter document { <field>: <value> } where <value> is the document to
match.
For example, the following query selects all documents where the field size equals the document
{ h: 14, w: 21, uom: "cm" }:

We put inside the argument the document itself.


We have one document in which a field is exactly equal to a document.
To specify a query condition on fields in an embedded/nested document, use dot notation
("field.nestedField").
The following example selects all documents where the field uom, nested in the size field, equals
"in":

The following query uses the less than operator ($lt) on the field “h” embedded in the “size” field:

The following query selects all documents where the nested field “h” is less than 15, the nested
field “uom” equals "in", and the “status” field equals "D":

Query an Array
To specify equality condition on an array, use the query document { <field>: <value> } where
<value> is the exact array to match, including the order of the elements.
The following example queries for all documents where the field “tags” value is an array with
exactly two elements, "red" and "blank", in the specified order:

If we wish to find an array that contains both the elements "red" and "blank", without regard to
order or if there are other elements in the array, use the $all operator:

To query if the array field contains at least one element with the specified value, use the filter
{ <field>:<value> } where <value> is the element value.
The following example queries for all documents where “tags“ is an array that contains the string
"red" as one of its elements:
To specify conditions on the elements in the array field, use query operators in the query filter
document { <array field>: { <operator1>: <value1>, ... } }
The following operations queries for all documents where the array ”dim_cm” contains at least
one element whose value is greater than 25 (first query) and is greater than 20 (second query).

The following example, queries for documents where the “dim_cm” array contains elements that
in some combination satisfy the query conditions; e.g., one element can satisfy the greater than 15
condition and another element can satisfy the less than 20 condition, or a single element can
satisfy both:

The following example queries for documents where the “dim_cm” array contains at least one
element that is both greater than ($gt) 22 and less than ($lt) 30:

Using dot notation, we can specify query conditions for an element at a particular index or
position of the array. The array uses zero-based indexing.
The following example queries for all documents where the second element in the array “dim_cm”
is greater than 25:

Use the $size operator to query for arrays by number of elements. For example, the following
selects documents where the array “tags” has 3 elements.
1.Import the restaurants datasets from the restaurants.json file.

From the command line


mongoimport -c=restaurants -d=exer --file=restaurants.json

2.Write a MongoDB query to count the total number of documents.

db.restaurants.find().count();

3.Write a MongoDB query to display all the restaurant which is in the


borough Bronx

db.restaurants.find({"borough": "Bronx"});

4. Write a MongoDB query to display the next 5 restaurants after skipping


first 5 which are in the borough Bronx

db.restaurants.find({"borough": "Bronx"}).skip(5).limit(5);

5. Write a MongoDB query to find the restaurants that achieved a score,


more than 80 but less than 100

db.restaurants.find({grades : { $elemMatch:{"score":{$gt : 80 ,
$lt :100}}}});

6.Write a MongoDB query to find the restaurants which locate in a latitude


value less than -95.754168.

db.restaurants.find({"address.coord" : {$lt : -95.754168}});

7.Write a MongoDB query to find the restaurants that do not prepare any
cuisine of 'American' and their grade score more than 70 and latitude less
than -65.754168

db.restaurants.find(
{$and:[{"cuisine" : {$ne :"American "}},
{"grades.score" : {$gt : 70}},
{"address.coord" : {$lt : -65.754168}}
]
}
);
8. Write a MongoDB query to find the restaurants which do not prepare any
cuisine of 'American ' and achieved a grade point 'A' not belongs to the
borough Brooklyn.
The document must be displayed according to the cuisine in descending order.

db.restaurants.find( {
"cuisine" : {$ne : "American "},
"grades.grade" :"A",
"borough": {$ne : "Brooklyn"}
}
).sort({"cuisine":-1});
Projection operations and Aggregations in MongoDB
By default, queries in MongoDB return all fields in matching documents.
To limit the amount of data that MongoDB sends to applications, we can include a projection
document to specify or restrict fields to return.
A projection can explicitly include several fields by setting the <field> to 1 in the projection
document.
In the following example, the operation returns all documents that match the query, limited to
fields item and status.

Instead of listing the fields to return in the matching document, you can use a projection to
exclude specific fields.
The following example which returns all fields except for the status and the instock fields in the
matching documents:

With the exception of the _id field, we cannot combine inclusion and exclusion statements in
projection documents. We have to decide a list we want to include (1) or we want to exclude (0).

We can return specific fields in an embedded document.


Use the dot notation to refer to the embedded field and set to 1 in the projection document.

We can suppress specific fields in an embedded document.


Use the dot notation to refer to the embedded field in the projection document and set to 0.
We can extract some elements of an array inside an embedded document.
The following example specifies a projection to return:
• The _id field (returned by default),
• The item field,
• The status field,
• The qty field in the documents embedded in the instock array.

For fields that contain arrays, MongoDB provides the following projection operators for
manipulating arrays: $elemMatch, $slice, and $.
$slice allow us to specify the element that we want (-1 = last).

Specifying 2 we will have the first two elements and the same for the other positive numbers and
instead using -2 we will have the last two elements and the same for other negative numbers.
The $elemMatch operator project the first matching element from an array based on a condition.

We won’t skip the first two objects because the parameter of the query is qty and they don’t have
it.
The { item : null } query matches documents that either contain the item field whose value is null
or that do not contain the item field.

The first document has the item field but with no value, this is possible.
It is possible to query documents that contains field of a specific type.

Type 2 means String, while 10 means null.


The following examples query for documents that contain/do not contain a specific field:

Aggregation
Aggregation operations process data records and return computed results.
Aggregation operations group values from multiple documents together and can perform a variety
of operations on the grouped data to return a single result.
For example, we can calculate statistics on distribution.
MongoDB quickly group documents and calculate statistics.
MongoDB provides three ways to perform aggregation:
• the aggregation pipeline, we consider only this
• the map-reduce function
• single purpose aggregation methods.

Documents enter a multi-stage pipeline that transforms the documents into aggregated results.
For example:

First Stage: The $match stage filters the documents by the status field and passes to the
next stage those documents that have status equal to "A".
Second Stage: The $group stage groups the documents by the cust_id field to calculate
the sum of the amount for each unique cust_id.
The id of the result will be cust_id and for each we calculate the sum of the amount.
The MongoDB aggregation pipeline consists of stages. Each stage transforms the documents as
they pass through the pipeline.
Pipeline stages do not need to produce one output document for every input document; e.g.,
some stages may generate new documents or filter out documents.
Pipeline stages can appear multiple times in the pipeline except for $out, $merge, and $geoNear
stages.

The $lookup retrieve information from another collection. If we use it, it must be justified.
If we consider more than one collection there is a warning, three or more is not allowed for sure.
Unwind must be justified too, starting from the array we’re extracting information and we need to
explain why we used array and not a specific collection.

Let’s see an example.


Each document in the zipcodes collection has the following form:

Return States with Populations above 10 Million:


The $group stage groups the documents of the zipcode collection by the state field, calculates the
totalPop field for each state, and outputs a document for each unique state.

The new per-state documents have two fields: the _id field and the totalPop field.
Then we have the $match stage that filters the grouped documents to output only those
documents
whose totalPop value is greater than or equal to 10 million.
The $match stage does not alter the matching documents but outputs the matching documents
unmodified.

Return average city population by State:

The aggregation pipeline consists of a $group stage followed by another $group stage.
We have to consider the population of each city and then calculate the average step by step.
The first $group stage groups the documents by the combination of city and state, uses the $sum
expression to calculate the population for each combination, and outputs a document for each city
and state combination.
The second $group stage groups the documents in the pipeline by the “_id.state” field (i.e. the
state field inside the _id document), uses the $avg expression to calculate the average city
population (avgCityPop) for each state, and outputs a document for each state.
Return the largest and the smallest cities by State.
To solve this query we build a pipeline which consists of a $group stage, a $sort stage, another
<$group stage, and a $project stage.
Let consider the first two stages:

The first $group stage groups the documents by the combination of the city and state, calculates
the sum of the pop values for each combination, and outputs a document for each city and state
combination.
The $sort stage orders the documents in the pipeline by the pop field value, from smallest to
largest; i.e. by increasing order. This operation does not alter the documents.
Let’s consider also the second $group stage in the pipeline:
We can explore the fact that we sorted cities in the previous stage.
The final $project stage: i) renames the _id field to state and ii) moves the biggestCity, biggestPop,
smallestCity, and smallestPop into biggestCity and smallestCity embedded documents.

Indexes
Indexes support the efficient execution of queries in MongoDB.
Without indexes, MongoDB must perform a collection scan, i.e. scan every document in a
collection, to select those documents that match the query statement.
If an appropriate index exists for a query, MongoDB can use the index to limit the number of
documents it must inspect.
Indexes are special data structures that store a small portion of the collection’s data set in an easy
to traverse form.
The index stores the value of a specific field or set of fields, ordered by the value of the field.
The ordering of the index entries supports efficient equality matches and range-based query
operations.
In addition, MongoDB can return sorted results by using the ordering in the index.

MongoDB defines indexes at the collection level and supports indexes on any field or sub-field of
the documents in a MongoDB collection.
We have a collection of keys which are values associated with the indexes.
Rather than scanning all documents, we will scan the keys inside the index and only the ones less
than 30.
There’s a default index, which is the _id index.
MongoDB creates a unique index on the _id field during the creation of a collection.
A unique index ensures that the indexed fields do not store duplicate values.
The _id index prevents clients from inserting two documents with the same value for the _id field.
The index on the _id field cannot be dropped.
The following method can be used for creating an index (we need to have a collection and inside
at least one document):
db.collection.createIndex(keys, options)
keys -> a document that contains the field and value pairs where:
 the field is the index key and
 the value describes the type of index for that field. For an ascending index on a field,
specify a value of 1; for descending index, specify a value of -1.
options-> a document which contains a set of options that controls the creation of the index.
The method only creates an index if an index of the same specification does not already exist.
For a single-field index and sort operations, the sort order (i.e. ascending or descending) of the
index key does not matter.
We can view index names using the db.collection.getIndexes() method.
We cannot rename an index once created. Instead, we must drop and re- create the index with a
new name.

We can have a compound index in which we consider multiple fields.


The order of fields listed in a compound index has significance.

We will have a data structure in which we will have the userId in ascending order and among the
same userid we will have the score ordered in descending order.

Let’s see how sorting with compound indexes work.


We can specify a sort query on all the keys of the index or on a subset.
However, the sort keys must be listed in the same order as they appear in the index.
For example, an index key pattern { a: 1, b: 1 } can support a sort on { a: 1, b: 1 } but not on { b: 1,
a: 1 }.
We will obtain results but not exploiting the index.
It’s important to follow the same order used when defining the index.
For a query to use a compound index for a sort, the specified sort direction for all keys must match
the index key pattern or match the inverse of the index key pattern.
For example, an index key pattern { a: 1, b: -1 } can support a sort on { a: 1, b: -1 } and { a: -1, b: 1 }
but not on { a: -1, b: -1 } or {a: 1, b: 1}.
We won’t exploit indexes if we go towards a different direction.

Indexes can be created on keys in embedded documents in the same way that they are created on
normal keys.
Let’s suppose to have a collection of documents like the following one:

We can define an index on the subfield ”city” as follows:


db.users.createIndex({"loc.city" : 1})
Indexes are good but need to be updated after each write operations.
What happens if for the previous example we define and index as follows
db.users.createIndex({loc : 1}) ?
We would be indexing the entire subdocument, it’s useful if we need to retrieve an exact matching
on the subdocument.
Indexing the entire subdocument will only help queries that are querying for the entire
subdocument, such as
db.users.find({"loc" : {"ip" : "123.456.789.000", "city" : "Shelby ville", "state" : "NY"}}})).
We won’t exploit it in the following case.
The following query: db.users.find({"loc.city" : "Shelbyville"}) will be not supported by the the
index on the entire document.

Regarding the indexing of arrays, let’s suppose to have a collection of posts extracted from a
social network. Each post includes a set of comments modelled as an array of documents:

We are not sure that comments will be ordered by date.


We can create an index on the date subfield of the field comments as follows:
db.blog.createIndex({"comments.date" : 1})
Notice that indexing an array creates an index entry for each element of the array, so if a post had
20 comments, it would have 20 index entries.
We won’t create an entry into the index, but for each element we would create an index. If we
create an index on one field inside a vector, for each one of the document in the field I would
create an entry.
This makes array indexes more expensive than single-value ones: for a single insert, update, or
remove, every array entry of the index might have to be updated.
If we delete something we need update all the entry of a specific document.
If we want to define an index, the amount of write operations must be a low part of the overall
operations, especially in case of embedded documents.
Consider a collection composed by documents like the following one:
{ _id: 1, item: "ABC", ratings: [ 2, 5, 9 ] }
If we create this index: db.survey.createIndex( { ratings: 1 } ). The index contains three keys, each
pointing to the same document.
Indexes on array elements do not keep any notion of position: we cannot use an index for a query
that is looking for a specific array element.

MongoDB offers the possibility of defining the following advanced indexes:


 Geospatial Index: supports efficient queries of geospatial coordinate data. MongoDB
provides two special indexes: 2d indexes that uses planar geometry when returning results
and 2dsphere indexes that use spherical geometry to return results.
 Text Index: MongoDB provides text indexes to support text search queries on string
content. Text indexes can include any field whose value is a string or an array of string
elements. It can be used to retrieve similar comments.
 Hash index: supports hash based sharding and indexes the hash of the value of a field. Only
support equality matches and cannot support range-based queries. It can be used to
control the load on the server.
 Unique Indexes: The unique property for an index causes MongoDB to reject duplicate
values for the indexed field, like the _id index. It ensures the absence of duplicated
documents for that field.
 Partial Indexes: only add an entry in the index of the documents in a collection that meet a
specified filter expression. By indexing a subset of the documents in a collection, partial
indexes have lower storage requirements and reduced performance costs for index
creation and maintenance.
 Sparse Indexes: the sparse property of an index ensures that the index only contain entries
for documents that have the indexed field. The index skips documents that do not have the
indexed field.
 TTL Indexes: are special indexes that MongoDB can use to automatically remove
documents from a collection after a certain amount of time. This is ideal for certain types
of information like machine generated event data, logs, and session information that only
need to persist in a database for a finite amount of time.
Evaluate the performance of a query with/without indexes

To evaluate the performance of a query chain, the execution time and how many documents
examinated in the query the cursor.explain("executionStats") cursor method to the end of the
find command as follows:
db.inventory.find( { quantity: { $gte: 100, $lte: 200 } } ).explain("executionStats") which returns the
following document:

We have the # of objects returned, the totalKeysExamined (we have no indexes so 0) and total
documents examinated.
Let define the following index: db.inventory.createIndex( { quantity: 1 } )
Let use the command:
db.inventory.find( { quantity: { $gte: 100, $lte: 200 } } ).explain("executionStats") which returns the
following document:
When we use the index we scan 3 keys and only 3 documents. This can prove that our index will
be good together with the proof of the amount of read/write operations.
Consider the following query:
db.inventory.find( { quantity: { $gte: 100, $lte: 300 }, type: "food" } ) which returns:
{ "_id" : 2, "item" : "f2", "type" : "food", "quantity" : 100 }
{ "_id" : 5, "item" : "f3", "type" : "food", "quantity" : 300 }
We can define the following two compound indexes:
db.inventory.createIndex( { quantity: 1, type: 1 } )
db.inventory.createIndex( { type: 1, quantity: 1 } )
With compound indexes, the order of the fields matter.
 The first index orders by quantity field first, and then the type field.
 The second index orders by type first, and then the quantity field.
We can define both and perform the query, MongoDB will automatically use the best one. It will
give us the winningPlan and the rejected one, using the explain command.
It depends on the distribution of the data.
We can also use the hint() method to force MongoDB to use a specific index for the query to
understand how long it takes to execute on one index.
Let use the command:
db.inventory.find( { quantity: { $gte: 100, $lte: 300 }, type: "food" } ).hint({ quantity: 1, type:
1 }).explain("executionStats"))
which returns the following document:
MongoDB Java Driver
This driver allows us to manipulate using Java Application data stored in a MongoDB database.
The recommended way to get started using one of the drivers in your project is with a
dependency management system, such as MAVEN.
It’s possible to work with the Java MongoDB Driver importing the following dependency in the
MAVEN file:

The MongoClients.create(), allows us to make a connection to a running MongoDB instance.


A simple way to connect to a MongoDB server is: MongoClients.create(connectionString)
E.g. MongoClient myClient = MongoClients.create("mongodb://localhost:27017”);
The MongoClient instance represents a pool of connections to the database; we will only need one
instance of class MongoClient even with multiple threads.
IMPORTANT: Mongoclient is thread-safe. Multiple access to the single instance is managed by the
class itself
Typically, for simple projects, we only create one MongoClient instance for a given MongoDB
deployment (e.g. standalone, replica set, or a sharded cluster) and use it across the whole
application.
Call MongoClient.close() to clean up resources at the end of the application.
Once we have a MongoClient instance connected to a MongoDB deployment, we can use the
MongoClient.getDatabase() method to access a database.
Specify the name of the database to the getDatabase() method. If a database does not exist,
MongoDB creates the database when you first store data for that database.
The following example accesses the mydb database:
MongoDatabase database = mongoClient.getDatabase("mydb");

Once we have a MongoDatabase instance, we can use its getCollection() method to access a
collection.
Specify the name of the collection to the getCollection() method. If a collection does not exist,
MongoDB creates the collection when you first store data for that collection.
For example, using the database instance, the following statement accesses the collection named
test in the mydb database:
MongoCollection<Document> collection = database.getCollection("test");
In the collection, Documents are BSON Documents but we can also use JSON Documents.

We have two options to create a Document.


To create the document using the Java driver, we can use the Document class. For example,
consider the following JSON document:
The first option allows us to specify a field and then appending other fields.
Otherwise, you can use a string that represent the json file (Be careful to escape double quotes!)
Document doc = Document.parse("{name:\"Gionatan\", surname:\"Gallo\"}");
The second option allows us to parse the JSON of the document.

To insert a single document into the collection, we can use the collection’s insertOne() method.
collection.insertOne(doc);
To insert a set of documents, contained into a list of documents we can use the following example
of code:

To query a collection, we can use the collection’s find() method.


We can call the method without any arguments to query all documents in a collection or pass a
filter to query for documents that match the filter criteria.
The following example retrieves all documents in the collection and prints the returned
documents:

We can iterate through query results by using a consumer function (statically or locally defined)
The foreach applies printDocuments to all results of the query.
Collect results in a list:

To query for documents that match certain conditions, pass a filter object to the find() method.
To facilitate creating filter objects, Java driver provides the Filters helper.
The following example retrieves all documents in the collection where 50 < i <= 100 and prints the
returned documents:

The and operator returns a BSON Document, useful if we don’t remember precisely the structure
of the document to use as a filter.

To update documents in a collection, we can use the collection’s updateOne and updateMany
methods.
These functions needs two parameters:
A filter object to determine the document or documents to update.
An update document that specifies the modifications.

updateOne() updates only the first.


The following example updates the first document that meets the filter i equals 10 and sets the
value of i to 110:
collection.updateOne(eq("i", 10), set(“i", 110));
The following example increments the value of i by 100 for all documents where the value of field
i is less than 100:
UpdateResult updateResult = collection.updateMany(lt("i", 100), inc("i", 100));
System.out.println(updateResult.getModifiedCount());
The update methods return an UpdateResult which provides information about the operation
including the number of documents modified by the update.

To delete documents from a collection, we can use the collection’s deleteOne and deleteMany
methods.
As a parameter it requires just a filter object to select the documents to delete.
The example deletes at most one document that meets the filter i equals 110:
collection.deleteOne(eq("i", 110));
The following example deletes all documents where i is greater or equal to 100:
DeleteResult deleteResult = collection.deleteMany(gte("i", 100));
System.out.println(deleteResult.getDeletedCount());
The delete methods return a DeleteResult which provides information about the operation
including the number of documents deleted.

To perform aggregation, pass a list of aggregation stages to the MongoCollection.aggregate()


method.

Import static filters, aggregations, projections and accumulators to improve readability of your
code.
Using static import allows us to do not use package names when using sum, aggregate, etc.

The match operator filters the documents to pass only the ones that match the specified
condition(s) to the next pipeline stage.
The group operator groups input documents by the specified _id expression (first argument) and
for each distinct grouping. It returns a document.
This operator can include accumulators (sum, avg, max, min, ...)

For multiple indexes it is better to define directly a document

If we want to group by using multiple fields the second option has to be used, we cannot use the
utility.

The project operator passes along the documents with the requested fields to the next stage in
the pipeline. The specified fields can be existing fields from the input documents or newly
computed fields. I can exclude, include or compute new fields.

Sometimes we may want to rename the _id field in another name, because the name of the
variable for which it has been grouped won’t be shown but a new ‘_id’ field will by default.
The sort operator Sorts all input documents and returns them to the pipeline in sorted order.
The order can be ascending or descending.

The limit operator limits the number of documents passed to the next stage. In conjunction with
limit, we can implement queries that search, for example, for the top 3 biggest cities
The skip operator, instead, skips over the specified number of documents that pass into the next
stage.
The unwind operator deconstructs an array field from the input documents to output a document
for each element Each output document is the input document with the value of the array field
replaced by the element.

Given a document with an array inside, after the unwind stage, we will have a document for each
value of the array with all other fields equal to n the original document.

To create an index on a field or fields, pass an index specification document to the createIndex()
method. An index key specification document contains the fields to index and the index type for
each field: new Document(<field1>, <type1>).append(<field2>, <type2>) ...
For an ascending index type, specify +1 for <type>. For a descending index type, specify -1 for
<type>. The following example creates an ascending index on the i field:
collection.createIndex(new Document("city", 1));

Exercise)
ZIPS dataset
1. Find zip codes of Texan cities.
2. Find cities' zips with a population of at least 100’000, but no more than
200’000.
3. Find the 5 most populated cities.
4. For each state, find the average population of its cities.
POSTS dataset
1. Find posts published after 2012-11-20.
2. Find posts with the tag 'computer'.
3. Find, for each tag, the total number of posts.
4. Find the top three commentators according to the number of comments.
5. Find the most versatile commentator. Versatile means that he/she
commented on the highest number of distinct topics (a.k.a. tags). For a tag
to count, he/she must have at least five comments about that topic.

package main.java.exercises;

import com.mongodb.ConnectionString;
import com.mongodb.client.*;
import com.mongodb.client.model.Accumulators;
import com.mongodb.client.model.Filters;
import com.mongodb.client.result.DeleteResult;
import com.mongodb.client.result.UpdateResult;
import org.bson.Document;
import org.bson.json.JsonWriterSettings;

import java.text.ParseException;
import java.text.SimpleDateFormat;
import java.util.*;
import java.util.function.Consumer;

import static com.mongodb.client.model.Filters.*;


import static com.mongodb.client.model.Updates.*;
import static com.mongodb.client.model.Aggregates.*;
import static com.mongodb.client.model.Projections.*;
import static com.mongodb.client.model.Accumulators.*;
import static com.mongodb.client.model.Accumulators.addToSet;
import static com.mongodb.client.model.Accumulators.sum;
import static com.mongodb.client.model.Sorts.descending;

import org.bson.conversions.Bson;

public class MongoDBJavaEx


{
private static Consumer<Document> printDocuments() {
return doc -> System.out.println(doc.toJson());
}

private static Consumer<Document> printFormattedDocuments() {


return doc ->
System.out.println(doc.toJson(JsonWriterSettings.builder().indent(true).buil
d()));
}

private static void getTop3PopulatedCitiesInTexas(MongoDatabase db) {


MongoCollection<Document> zips = db.getCollection("zips");

Bson match = match(eq("state", "TX"));


Bson group = group("$city", sum("totalPop", "$pop"));
Bson project = project(fields(excludeId(), include("totalPop"),
computed("city", "$_id")));
Bson sort = sort(descending("totalPop"));
Bson limit = limit(3);

zips.aggregate(Arrays.asList(match, group, project, sort,


limit)).forEach(printDocuments());
//List<Document> results = zips.aggregate(Arrays.asList(match,
group, project, sort, limit)).into(new ArrayList<>());
System.out.println("==> 3 most densely populated cities in Texas");
//results.forEach(printFormattedDocuments());
}

private static void getTop3Tags(MongoDatabase db) {

MongoCollection<Document> posts = db.getCollection("posts");

Bson unwind = unwind("$tags");


Bson group = group("$tags", sum("count", 1L),
Accumulators.push("titles", "$title"));
Bson sort = sort(descending("count"));
Bson limit = limit(3);
Bson project = project(fields(excludeId(), computed("tag", "$_id"),
include("count", "titles")));

//posts.aggregate(Arrays.asList(unwind,
limit(10))).forEach(printDocuments());
List<Document> results = posts.aggregate(Arrays.asList(unwind,
group, sort, limit, project)).into(new ArrayList<>());
System.out.println("==> 3 most popular tags and their posts
titles");
results.forEach(printFormattedDocuments());

private static void tagsExercises(MongoDatabase db)


{
MongoCollection<Document> posts = db.getCollection("posts");

//1 - Find Posts published after 2012-11-20


System.out.println("Query 1");
SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd",
Locale.ENGLISH);
Date myDate = null;
try {
myDate = sdf.parse("2012-11-20");
} catch (ParseException e) { e.printStackTrace(); }
posts.find(gte("date", myDate)).forEach(printDocuments());

//2 - Find Posts with the tag 'computer'


System.out.println("Query 2");
posts.find(eq("tags", "computer")).forEach(printDocuments());

//3 - Find, for each tag, the total number of posts


System.out.println("Query 3");
Bson unwind = unwind("$tags");
Bson group = group("$tags", sum("count", 1L));
Bson project = project(fields(excludeId(), computed("tag", "$_id"),
include("count")));
Bson limit = limit(10);

posts.aggregate(Arrays.asList(unwind, group, project, limit))


.forEach(printDocuments());

//4 - Find the top 3 commentators according to the number of


comments
System.out.println("Query 4");
unwind = unwind("$comments");
group = group("$comments.author", sum("numPosts", 1L));
project = project(fields(excludeId(), computed("commentator",
"$_id"), include("numPosts")));
Bson sort = sort(descending("numPosts"));
limit = limit(3);

posts.aggregate(Arrays.asList(unwind, group, project, sort, limit))


.forEach(printDocuments());

//5 - Find the most versatile commentator


System.out.println("Query 5");
Bson unwind1 = unwind("$comments");
Bson unwind2 = unwind("$tags");
Bson group1 = new Document("$group",
new Document("_id", new Document("commentator",
"$comments.author").append("tag", "$tags"))
.append("numPosts", new Document("$sum", 1L)));
Bson match = match(gte("numPosts", 5));
Bson group2 = group("$_id.commentator", sum("numDistinctTags", 1L),
addToSet("tags", "$_id.tag"));
sort = sort(descending("numDistinctTags"));
project = project(fields(excludeId(), computed("commentator",
"$_id"), include("tags", "numDistinctTags")));

Document mostVersatileCommentator = posts.aggregate(


Arrays.asList(unwind1, unwind2, group1, match, group2, sort,
project)).first();
System.out.println(mostVersatileCommentator.toJson());

private static void zipExercises(MongoDatabase db)


{
MongoCollection<Document> zips = db.getCollection("zips");

//1 - Find Texan cities


//zips.find(eq("state", "TX")).forEach(printDocuments());

//2 - Find cities' zips with a population of at least 100'000


// but no more than 200'000
zips.find(and(gte("pop",100000), lte("pop",200000)))
.forEach(printDocuments());

//3 - Find the 5 most populated cities


Bson group = group("$city", sum("totalPop", "$pop"));
Bson project = project(fields(excludeId(), computed("city", "$_id"),
include("totalPop")));
Bson sort = sort(descending("totalPop"));
Bson limit = limit(5);

zips.aggregate(Arrays.asList(group, project, sort, limit))


.forEach(printDocuments());

//4 - For each state, find the average population of its cities
//
Bson group1 = new Document("$group", new Document("_id", new
Document("city", "$city")
.append("state", "$state"))
.append("totalPop", new Document("$sum", "$pop")));
Bson project1 = project(fields(excludeId(), computed("city",
"$_id.city"),
computed("state", "$_id.state"), include("totalPop")));
Bson group2 = group("$state", avg("avgStatePop", "$totalPop"));
Bson project2 = project(fields(excludeId(), computed("state",
"$_id"), include("avgStatePop")));

zips.aggregate(Arrays.asList(group1, project1, group2, project2))


.forEach(printDocuments());
}

public static void main(String[] args)


{
//---Connect to the MongoDB---
ConnectionString uri = new ConnectionString(

"mongodb+srv://admin:admin@largescaleandmultistruc.jhdlw.mongodb.net/
sample_training?retryWrites=true&w=majority");

MongoClient mongoClient = MongoClients.create(uri);


MongoDatabase db = mongoClient.getDatabase("sample_training");
//getTop3PopulatedCitiesInTexas(db);
//getTop3Tags(db);
//zipExercises(db);
tagsExercises(db);
}
}

MongoDB Clustering
We can also create a cluster made of different MongoDB servers, and all replications, updates will
be done automatically.
We can use the cluster as a single entity, as a standalone representation.
We can have a cluster of different instances in the same machine with different ports.
In Ubuntu:
mkdir ~/data/r{1,2,3}
Open 3 separate terminals and start 3 MongoDB servers on replica set mode (--replSet option):
mongod --replSet lsmdb --dbpath ~/data/r1 –port 27018 --bind_ip localhost --oplogSize 200
mongod --replSet lsmdb --dbpath ~/data/r2 –port 27019 --bind_ip localhost --oplogSize 200
mongod --replSet lsmdb --dbpath ~/data/r3 –port 27020 --bind_ip localhost --oplogSize 200
They will be running but not yet forming a cluster, they don’t know how to reach other members.
To fully configure the replica set, follow these steps:
1. Open a 4th terminal and connect to one of the three instances, for example, the first one
mongo --port 27018
2. Define the following configuration:

3. Initiate the replica set (only ONCE)


rs.initiate(rsconf);
4. Check the status to see which is the primary node or nodes’ status:
rs.status();
5. Change configuration specifying the other structure with the new configuration:
rs.reconfig(rsconfig);
We can have additional configurations as follows.
Set different priorities to replica set members in order to change the default behavior during the
primary election. (by default priority 1 for all of them)
{_id: 0, host: "localhost:27018", priority: 0.5}
{_id: 1, host: "localhost:27019", priority: 1},
{_id: 2, host: "localhost:27020", priority: 3}]
Set tags for different members:
{_id:0, host:"localhost:27018", tags:{dc:”east”}},
{_id:1, host:"localhost:27019", tags:{dc:”mid”}},
{_id:2, host:"localhost:27020", tags:{dc:”west”}}]
Used for example to give preferences to the Java Driver, for example saying that we can prefer this
tag (for example a close one).
Set different priorities to replica set members in order to change the default behavior during the
primary election

- W is the minimum number of writes that the application waits until it regains the flow of
execution. If W = 2 when the application performs a write the ack from the cluster will not
arrive until at least two nodes will perform the write (not only the primary).
The default value is majority (n/2+1 if odd or n/2 if even), but it can be any integer >= 1.
The higher the value, the higher the waiting time but we will have a higher consistency.
- Wtimeout is the maximum waiting time (in ms) before regaining the flow of execution. If
set to 0 the application will wait indefinitely or until “w” is satisfied.
If the primary shutdowns a new primary will be elected.
They will search periodically if the third member has come back up again.

Configure a distributed Replica Set


First, we need to connect through a VPN, install mongoDB on each VM and start and configure the
servers in a similar fashion as before.

Connect via VPN using the largescale.ovpn file and through ssh use user=root and pass=root.
Open 3 different terminals and issues the following commands
#Terminal 1
ssh root@172.16.4.99
mkdir data
mongod --replSet lsmdb --dbpath ~/data –port 27020 --bind_ip localhost,172.16.4.99 –oplogSize
200
#Terminal 2
ssh root@172.16.4.100
mkdir data
mongod --replSet lsmdb --dbpath ~/data –port 27020 --bind_ip localhost,172.16.4.100 –oplogSize
200
#Terminal 3
ssh root@172.16.4.101
mkdir data
mongod --replSet lsmdb --dbpath ~/data –port 27020 --bind_ip localhost,172.16.4.101 –oplogSize
200

Connect to one of the VM through ssh (open a 4th terminal)

The cluster will now be a single entity.


We can access the replica set from Java Driver:
In the Connection String we specify all the member of clusters, but it’s not mandatory, at least we
need to specify one member of the cluster.
If a write fails we tell the clusters to automatically retry them.
We can create mongoClient settings as second method.
We can specify the write concern at the application level and the one we will have will be W = 1 or
majority?
W = majority is specified here, W = 1 is specified at the beginning.
The one will be used is W = majority, because if the user specifies that he defines W = 3, because it
is the application that waits for the writes so he can decide on it.

The configuration at the application level overwrites the previous.


The Write Concern is specified at the level of the cluster or for that connection I have the Client
concern.
We can also create a database with a specific write concern or also regarding the collections.
A read preference provides client applications with control over which nodes of a replica set are
used for reads.
A client application defines its read preference by selecting one of the five behavioral modes:

Only the primary will return the readed value in the first case and so on.

They ca be specified at the db or collection level too.

We can use the tags configured into our replicas to state preferences on secondary replicas
(cannot specify a tag preference on a primary)
Exercise)
package main.java;

import com.mongodb.*;
import com.mongodb.client.*;
import com.mongodb.client.model.Updates;
import com.mongodb.client.result.DeleteResult;
import com.mongodb.client.result.UpdateResult;
import com.mongodb.connection.ClusterDescription;
import org.bson.Document;
import org.bson.json.JsonWriterSettings;

import java.util.ArrayList;
import java.util.Arrays;
import java.util.Date;
import java.util.List;
import java.util.function.Consumer;

import static com.mongodb.client.model.Filters.*;


import static com.mongodb.client.model.Updates.*;

public class MongoDBJavaReplicas


{
private static Consumer<Document> printDocuments() {
return doc ->
System.out.println(doc.toJson(JsonWriterSettings.builder().indent(true).buil
d()));
}

private static void populateDB(MongoCollection<Document> myColl)


{
Document student = new Document("name", "Laura")
.append("age", 25)
.append("gender", "F")
.append("grades", Arrays.asList(
new Document("mark", 25).append("DateOfExam", new
Date()).append("name", "PerformanceEvaluation"),
new Document("mark", 30).append("DateOfExam", new
Date()).append("name", "ComputerArchitecture"),
new Document("mark", 28).append("DateOfExam", new
Date()).append("name", "LargeScale")
))
.append("location", new Document("x", 203).append("y",
102));
myColl.insertOne(student);

// 2 -Insert multiple documents


List<Document> documents = new ArrayList<>();
List<String> names = Arrays.asList("Gionatan", "Luigi", "Marco",
"Federico", "Paolo");
for(String name: names)
{
int markPE = (int)((Math.random() * (30 - 18)) + 18);
int markCA = (int)((Math.random() * (30 - 18)) + 18);
int markLS = (int)((Math.random() * (30 - 18)) + 18);
student = new Document("name", name)
.append("age", 25 + (int)((Math.random() * 5) - 2))
.append("gender", "M")
.append("grades", Arrays asList(
new Document("mark", markPE).append("DateOfExam", new
Date()).append("name", "PerformanceEvaluation"),
new Document("mark", markCA).append("DateOfExam", new
Date()).append("name", "ComputerArchitecture"),
new Document("mark", markLS).append("DateOfExam", new
Date()).append("name", "LargeScale")
))
.append("location", new Document("x", 203).append("y", 102));
documents.add(student);
}
myColl.insertMany(documents);
}

public static void main(String[] args)


{

ConnectionString uri = new ConnectionString(

"mongodb+srv://admin:<password>@largescaleandmultistruc.jhdlw.mongodb.net/
<dbname>?retryWrites=true&w=majority");
MongoClient mongoClientAtlas = MongoClients.create(uri);

// Method 1
MongoClient mongoClient = MongoClients.create(
"mongodb://localhost:27018,localhost:27019,localhost:27020/"
+ "?retryWrites=true&w=majority&wtimeout=10000");

// Method 2
// You can specify one or more nodes in the replica set.
// MongoDB will automatically discover PRIMARY and SECONDARY nodes
within the cluster
uri = new ConnectionString("mongodb://localhost:27018");
MongoClientSettings mcs = MongoClientSettings.builder()
.applyConnectionString(uri)
.readPreference(ReadPreference.nearest())
.retryWrites(true)
.writeConcern(WriteConcern.ACKNOWLEDGED).build();
MongoClient mongoClient2 = MongoClients.create(mcs);

/*
// Read Preferences at client level
mongoClient = MongoClients.create(
"mongodb://localhost:27018/?readPreference=secondary");
*/
// Read Preferences at DB level
MongoDatabase db = mongoClient.getDatabase("LSMDB")
.withReadPreference(ReadPreference.secondary());

// Read Preferences at collection level


MongoCollection<Document> myColl = db.getCollection("students")
.withReadPreference(ReadPreference.secondary());

// Write concern at client level


mongoClient = MongoClients.create(
"mongodb://localhost:27018/?w=2&wtimeout=5000");

// Write concern at DB level


db = mongoClient.getDatabase("LSMDB")
.withWriteConcern(WriteConcern.W2);

// Tagged read preferences


Tag west_tag = new Tag("dc", "west");
ReadPreference tagged_pref =
ReadPreference.secondaryPreferred(new TagSet(west_tag));
myColl = db.getCollection("students")
.withReadPreference(tagged_pref);

myColl.find().forEach(printDocuments());

//populateDB(myColl);

//---Count # of documents in a collection---


System.out.println(myColl.countDocuments());

// 2 - Find the first document


myColl.find().limit(1).forEach(printDocuments());

//--- Close connection ---


mongoClient.close();
}
}

Neo4J
Neo4j is a native graph database, built from the ground up to leverage not only data but also data
relationships.
Unlike traditional databases, which arrange data in rows, columns and tables, Neo4j has a flexible
structure defined by stored relationships between data records.
Each data record, or node, stores direct pointers to all the nodes it’s connected to.
Neo4j's design allows to perform queries with complex connections orders of magnitude faster,
and with more depth, than other databases.
Also, relationships are stored like entities.
MySQL does not handle well multiple joins.
In MySQL we need a lot of joins to reach some information, while here we need to follow edges.
The relationship of following stores in the node the direct address of the node we want to reach,
and there we will have the address of the previous node.
We can traverse even in reverse order.
A Neo4J graph is the typical graph composed by:
- Nodes
- Labels
- Relationships
- Properties
- Indexes
- Constraints
A node represents an object in the graph. It stores data similarly to rows in DBMS or
document/item in NoSQL.

A node can:
- have associated properties
(schema-free)
- connect with other objects
through a relationship (with a
direction)
- be labeled
- be indexed
Labels are used to shape the domain by grouping nodes into sets where all nodes that have a
certain label belong to the same set.
- A node can have from zero to many labels.
- Can be added and removed dynamically
- Conventionally expressed in CamelCase

A relationship connects two nodes.


Relationships organize nodes into structures, allowing a graph to resemble a list, a tree, a map, or
a compound entity.
A relationship must have exactly one relationship type. Typically expressed in UPPER CASE.
- It can have associated properties
- Can be added and removed dynamically
- A node can have relationships to itself.
Temporary relationships can also be used.

Properties are name-value pairs that are used to add qualities to nodes and
relationships.
Property types comprise:
• Number, an abstract type, which has the subtypes Integer and Float
• String
• Boolean
• The spatial type Point
• Temporal types: Date, Time, LocalTime, DateTime, LocalDateTime and
Duration
They typically contain small information because it is stored in memory.
We can have subsets of information in the Graph DB and a subset on the
Document DB even with some redundancy, if it’s needed.

Traversing a graph means visiting nodes by following relationships according to


some rules. In MySQL it would be a join operation.
The traversal result could be returned as a path.

Indexes can speed up queries.


The main reason for using indexes in a graph database is to find the starting point of a graph
traversal. Once that starting point is found, the traversal relies on in-graph structures to achieve
high performance.
An index can be:
- Single-property: the index refers to a single property for a given label. It can match ranges.
- Composite: the index refers to multiple properties for a given label. It can match only by
equality.

Constraints are used to make sure that the data adheres to the rules of the domain.
We can create constraints to:
- Enforce uniqueness (e.g. each Person node is unique)
- Enforce existence of a property in a node (e.g. each person must have the property name
defined)

Supernodes are nodes having a huge number of relationships (hundreds of thousands)


They are problematic because they considerably slow down graph traversal when all relationships
are to be traversed.
For example, given a social media graph, a celebrity node is a super node.
We can consider the edges from 2010-2020 in a node, 2000-2010 in another and this is a smart
way to divide a supernode.
Relationships are stored within a node and this can be a memory problem.

Neo4J: Introduction to Cypher

Neo4J and Cypher are still understandable by non-developers.


The Data Integration let Neo4J integrate with other databases
Neo4J is a native graph database, it doesn’t use indexes in relations.
We have the Index Free Adjacency (IFA).
If we have two nodes, if one wants to visit the other with IFA it can point directly where the other
node is.
We hop from one point to another with a redirection.
In Relational DB we have the problem that Bob doesn’t know where Ann lives.

In a Relational DB links between table are provided using foreign keys, and possibly a lookup table
(the central index in the analogy).
Traversing a foreign key requires an index lookup.
The operation requires the same time if we repeat the query.
The purpose of graphs is to do rapid traversal. The RDBMS model is too expensive for that.

Typically, in a RDBMS the index is stored in a B+ Tree, so the complexity for searching is O(logN).

The index is optimized but still expensive.


In IFA we use pointers to get the target address, logical address and not physical in the RAM.

I also have links in the reverse direction.

- No index lookups or table scans


- Reduced duplication of foreign keys

There’s one issue in deleting nodes because they’re stored in memory.


If we delete a node, we may not be able to reuse the freed space.

How to avoid wasted space on deletion and insertion?


One possible solution is dividing nodes in chunks.
All data objects are chunked into same-size pieces.

In the following we show an example of a node:

Nodes are the name for data records (entity) in a graph.


Data is stored as Properties.
Properties are simple name/value pairs.
Nodes can be grouped together by applying a Label to each member.
A node can have zero or more labels. Labels do not have any properties.
Similar nodes can have different properties.
Properties can be strings, numbers, or Booleans.

Relationships always have direction.


Relationships always have a type and form patterns of data.
Properties on relationships: store information shared by two nodes.

Cypher is a declarative query language that focuses on what to get, not how to retrieve it.
Uses keywords such as MATCH, WHERE, CREATE.
Cypher queries run on the DB server where the graph is loaded into the memory.
The server must have enough memory to perform the query.

In the following, we show an example for creating a node in a small social graph:

• CREATE clause to create data


• () parenthesis to indicate a node
• ee:Person is made of a variable 'ee' and a label 'Person’, ee can be seen as the variable name
We can specify another label using ‘:’ again.
For example, ee:Person:Assistant.
• brackets to add properties to the node

The following statement retrieves a node:

• MATCH clause to specify a pattern of nodes and relationships


• (ee:Person) a single node pattern with label 'Person' which will assign matches to the variable
'ee’
• WHERE clause to constrain the results
• ee.name = "Emil" compares name property to the value "Emil”
• RETURN clause used to request particular results
We need a variable name because we specify conditions.
We specify the label and the where condition on name.
The match can be specified also in this way:

As shown in the following, CREATE clauses can create many nodes and relationships at once:

We attach to Emil the new graph we are going to create.


The first 4 elements create nodes, while the others, create the edges.
When we create relationships between existing nodes, we first have to match the two endpoints
and then create a relation.

match(n) return (n); returns all nodes.


We may add a node to the graph as follows:

We may add a relationship between two existing nodes as follows:

If we create the same node two times we will have a duplicate.


If we want to avoid this, we can use the Merge operation.

The database won’t change because there is already a node with that name.
If we merge for a node that doesn’t exist, a new node will be added.
The merge checks the node, and it has to have the same set of label and properties.
The merge can have two conditions, create or match.
If it matches we don’t have any effect, otherwise a new node will be created.
Merge a node and set properties if the node needs to be created:

If this merge will create a new node, we want to put the timestamp in a new attribute.

If this merge will match, we want to add a new attribute.

Pattern Matching
We can match for a pattern:

The node can have any label.


We specify the direction but using – instead of arrows, we will not have directions.

We will not be interested in directions.


The following statement is used for finding Johan’s Friends:

• MATCH clause to describe the pattern from known Nodes to found Nodes
• (ee) starts the pattern with a Person (qualified by WHERE)
• -[:KNOWS]-matches "KNOWS" relationships (in either direction)
• (friends)will be bound to Johan’s friends
Pattern matching can be used to make recommendations. Johan is learning to surf, so he may
want to find a new friend who already does:

• () empty parenthesis to ignore these nodes, we have a node but without the interest on labelling
it
• DISTINCT because more than one path will match the pattern and it can return twice the same
result
• Surfer will contain Allison, a friend of a friend who surfs

In the following we show an example to set or update a property for a specific node:

We first have to search for that node/relationship, and once matched we can set the property to a
new value.
The setting may depend on to the result of a filtering:

We set the property to the new value only if the condition is satisfied.
The following statement shows how to set or modify the value of properties on relationships:

returning the path will return the complete path.


We put the name of the relationship because we want to change it.
The following example shows how to add a Label to a specific node:

To remove a property, we can set the specific property to NULL.

The example above removes the property worksIn from all the nodes with label Person.
To delete a relationship of a node we first find the endpoints:

To delete all relationships of a node:

To delete a node:

It will return an error if there are still relationships between these nodes, it works only if nodes are
isolated. To avoid it we have to use detach. To delete all nodes and relationships:

It will delete relationships and then the node.


Similar to MongoDB we have skip and limit.
If we return values we will have a table and not a graph as output.

In the first query we will have as output also Tom Hanks; If we don’t want to have it we need to
specify: WHERE tom <> coactors.
We want no more than 4 hops distance, we have a range from 1 to 4. We can avoid specifying the
pattern.
We can have any number of hops in the 3rd query. p takes the shortest path.
strength: cocoactors more connected to Tom Hanks, how many ways to reach it (having in
common movies or cofactors).

Neo4J: Java Driver


Neo4j provides official drivers for many programming languages.
• These drivers are supported by Neo4j and come with full cluster routing support.
• Community drivers also exist for many languages, but vary greatly in terms of feature sets,
maturity, and support.
The driver API is intended to be topologically agnostic.
• This means that the underlying database topology — single instance, Causal Cluster, etc. — can
be altered without requiring a corresponding alteration to application code.
We access the driver as a whole and the driver will manage the single instance to access.
• In the general case, only the connection URI needs to be modified when changes are made to
the topology.
• The official drivers do not support HTTP communication.
• If we need an HTTP driver, choose one of the community drivers.
When using Maven, add the following block to your pom.xml file:

A Neo4j client application will require a driver instance in order to provide access to the database.
• This driver should be made available to all parts of the application that need to interact with
Neo4j.
• The driver can be considered thread-safe, we don’t have to lock resources to avoid deadlocks or
other problems.
• Applications will typically construct a driver instance on startup and destroy it on exit.
Every time we open the driver instance, we must also close it.
• Destroying a driver instance will immediately shut down any connections previously opened via
that driver, and temporary results will be released.
If we have a driver object open but no connection open, then we are not consuming resources.
To construct a driver instance, a connection URI and authentication information must be supplied.
The default username is neo4j and the password is what we defined.
• Additional configuration details can be supplied if required.
• All of these details are immutable for the lifetime of the driver.
• Therefore, if multiple configurations are required (such as when working with multiple database
users) then multiple driver instances must be used.
• When we work with a single server database, we can create a direct driver via bolt URI, for
example: bolt://localhost:7687.
For a cluster, please use the neo4j URI neo4j://localhost:7687
Because if we use bolt, all requests will be redirected to the bolt server we specified.
Authentication details are provided as an auth token which contains the usernames, passwords or
other credentials required to access the database.
Neo4j supports multiple authentication standards but uses basic authentication by default.

It’s also possible to support encryption.


We pass to the driver a URI and an authentication token.
The class implements AutoCloseable and it will automatically close the resources we specified in
this class.
The driver maintains a pool of connections.
• The pooled connections are reused by sessions and transactions to avoid the overhead added by
establishing new connections for every query.
The connections are limited in number and they’re allocated/deallocated dynamically by the
driver.
• The connection pool always starts up empty. New connections are created on demand by
sessions and transactions. We don’t have to specify the dimension.
• When a session or a transaction is done with its execution, the connection will be returned to
the pool to be reused.
• Application users can tune connection pool settings to configure the driver for different use
cases based on client performance requirements and database resource consumption limits.
If a transactions fails, will try to reapply the same transaction.
Moreover, also the connection timeout and the max retry time can be configured.

This is the architecture of the session.

A session works within a session context and within the session context we have different
transactions.
Within session transactions are performed according the CAUSALITY constraint.
This is also guaranteed within a cluster.
Bookmarking system allows Neo4J to recognize which members of a cluster can execute a
transaction (only up-to-date members).
If we bookmark a transaction, Neo4J it will look for all member up-to-date to this transaction.
The routing layer will search for an available member to performing the transaction respecting the
casuality constraints (for example a read after a write).
We just specify what transaction we want, and it will be handled by the driver.
Sequential transactions can also be executed by different members of the cluster but in a casuality
manner.

A session is a lightweight container for a sequence of transactions.


• Sessions borrow connections from a pool as required and so should be considered lightweight
and disposable.
• Sessions are usually scoped within a context block. This because we can refer to results in a
scope of a session. When a session is closed, all records read so far are closed.
• This ensures that they are properly closed and that any underlying connections are released and
not leaked.

Transactions are atomic units of work consisting of one or more Cypher statement executions.
We execute all statements or none.
A transaction is executed within a session.
• To execute a Cypher statement, two pieces of information are required: the statement template,
a template string with inside bookmarks, and a keyed set of parameters that will be replaced
inside a string.
• The template is a string containing placeholders that are substituted with parameter values at
runtime.
• The Neo4j driver API provides for three forms of transaction:
- Auto-commit transactions
This is auto-commit, we perform a transaction directly into a session.
If we have just a single-line transaction in a session, it can be performed directly to the
session and it wil auto-commit.
If it fails, it will not retry to apply the command
- Transaction functions
The first time it fails, the system will try to re-execute it until the maximum number of tries
and there’s also a timeout.
- Explicit transactions
Of these, only transaction function can be automatically replayed on failure.

An auto-commit transaction is a simple but limited form of transaction.


Such a transaction consists of only one Cypher statement, cannot be automatically replayed on
failure, and cannot take part in a causal chain (if we have a set of transaction casually correlated
they cannot be put in a chain).
Auto-commit transactions are sent to the network and acknowledged immediately.
They receive, after the completion, the acknowledgment of the completion.
This means that these transactions exhibit a lesser efficiency than other forms of transaction.
They are not preferred within all transactions.

Transaction functions are the recommended form for containing transactional units of work.
Transaction functions are also able to handle connection problems and transient errors using an
automatic retry mechanism.
The retry capability can be configured on Driver construction.
We have a session, and we don’t apply session.run() but we call the writeTransaction() that needs
an anonymous inner class and inside execute we put the code of the transaction.
We need to specify the return value of the transaction in the function we return.
We can assign the writeTransaction to an integer in this case, if we specify a list of string or void
inside <> we will have a different return result.
The anonymous inner does not have a name.

Considering access modes, transactions can be executed in either read or write mode.
In a Causal Cluster, each transaction will be routed by the driver to an appropriate server based on
the mode.
When using a single instance, all transactions will be passed to that one server.

Access modes can be supplied in two ways: per transaction or per session.
In the general case, access mode should always be specified at transaction level, using transaction
functions (writeTransaction or readdTransaction).
The session-level setting is only necessary for explicit and auto-commit transactions.
Note that the driver does not parse Cypher and cannot determine whether a transaction is
intended to carry out read or write operations.
The driver does not see an operation as write operation seeing that it is a CREATE or something
else, we need to specify explicit transaction functions.
As a result of this, a write transaction tagged for read will be sent to a read server but will fail on
execution.
Here we have both write/read transaction within a session.
*StatementResult = Result
We get the first value with get(0). We can specify the position or the name of the column we want
to get, we need to convert it to the correct type.

Cypher
These are the Cypher types that always have the corresponding Java types.

The float of Cypher is a double in java for example.


A Result object is a list of records. The result is typically handled by the receiving application as it
arrives, although it is also possible to retain all or part of a result for future
consumption.

A record provides an immutable view of part of a result.


It is an ordered map of keys and values.
As values within a record are both keyed and ordered, that value can be
accessed either by position (0-based integer) or by key (string).

Query results will often be consumed as a stream.


Drivers provide a language-idiomatic way to iterate through a result
stream.
We want to retry the list of movie titles and we perform a read transaction.
We can see the template of the query and we can see $name the parameter of the query, that will
be dynamically substituted by the actorName.
The query will be the same, and dynamically we can specify the correct parameter.
We can get the iterator or use hasNext()/next().
movieTitles will contain all movie titles for a particular actor.
We can use the results within the transaction or retain the results and use them further.
The result record stream is available until another statement is run in the session, or until the
current transaction is closed. To hold on to the results beyond this scope, the results need to be
explicitly retained.

package main.java;

import org.neo4j.driver.*;
import org.neo4j.driver.types.Node;
import org.neo4j.driver.types.Path;
import java.util.ArrayList;
import java.util.List;
import static org.neo4j.driver.Values.parameters;

public class NeoJavaLesson implements AutoCloseable{


private final Driver driver;
public NeoJavaLesson( String uri, String user, String password ){
driver = GraphDatabase.driver( uri, AuthTokens.basic( user, password ) );
}

@Override // because we are implementing AutoCloseable


public void close() throws Exception{
driver.close();
}

public static void main( String... args ) throws Exception{


try ( NeoJavaLesson neo4j = new NeoJavaLesson( "neo4j://localhost:7687", "neo4j", "root" ) ){
neo4j.addPerson( "Gionatan", "Italy", 27 );
neo4j.printAge( "Gionatan");
neo4j.printActorMovies( "Tom Hanks");
neo4j.printCoActors( "Tom Hanks");
neo4j.printMostFamousActor();
neo4j.printActorsWithAtLeastNMovies(5);
neo4j.printMoviesWithVeryOldDirectors();
neo4j.printShortestPathWithIntermediaryActor("Tom Hanks",
"Kevin Bacon", "Meg Ryan");
}
}

public void addPerson( final String name, final String from, final int age )
{
try ( Session session = driver.session() ){
session.writeTransaction((TransactionWork<Void>) tx -> { //Void because we don’t want to
return anything
tx.run( "MERGE (p:Person {name: $name, from: $from, age: $age})",
parameters( "name", name, "from", from, "age", age ) );
// MERGE because if it already exists, CREATE will return an exception
// we always have to execute write operations that are idempotent
// we also have to handle exceptions
return null;
});
}
}

private void printAge(String name){


try ( Session session = driver.session() ){
Integer age = session.readTransaction((TransactionWork<Integer>) tx -> { //return an
integer
Result result = tx.run( "MATCH (p:Person) WHERE p.name = $name RETURN p.age",
parameters( "name", name) );
return result.single().get(0).asInt(); //return the first element, we can also use “p.age”
because we didn’t renamed it, if we return p.age AS age we have to use age
});
System.out.println(age);
}
}

//Find and show all the movie titles an actor acted in


private void printActorMovies( final String actorName){
try ( Session session = driver.session() ){
List<String> movieTitles = session.readTransaction((TransactionWork<List<String>>) tx -> {
Result result = tx.run( "MATCH (p:Person)-[:ACTED_IN]->(m) WHERE p.name = $name" +
" RETURN m.title as Title",
parameters( "name", actorName) );
// We want to return a list of strings
ArrayList<String> movies = new ArrayList<>();
while(result.hasNext())
{
Record r = result.next();
movies.add(r.get("Title").asString()); // we will add the title in the list of strings
}
return movies;
});
System.out.println(movieTitles);
}
}

//Find and show all the co-actors of an actor


private void printCoActors( final String actorName){
try ( Session session = driver.session()){
List<String> coActorsNames = session.readTransaction((TransactionWork<List<String>>) tx ->
{
Result result = tx.run( "MATCH (p:Person)-[:ACTED_IN]->(m)<-[:ACTED_IN]-(others) " +
"WHERE p.name = $name " +
"RETURN others.name as Coactors",
parameters( "name", actorName) );
ArrayList<String> coActors = new ArrayList<>();
while(result.hasNext())
{
Record r = result.next();
coActors.add(r.get("Coactors").asString());
}
return coActors;
});
System.out.println(coActorsNames);
}
}

//Find the actor with the maximum number of movies who acted in
private void printMostFamousActor()
{
try ( Session session = driver.session() )
{
String mostFamousActor = session.readTransaction((TransactionWork<String>) tx -> {
String query = "MATCH (p:Person)-[:ACTED_IN]->(m:Movie) " +
"RETURN p.name AS Name, count(*) as NumMovies " +
"ORDER BY NumMovies DESC " +
"LIMIT 1";
Result result = tx.run( query );
return result.single().get("Name").asString();
});
System.out.println(mostFamousActor);
}
}

//Find actors with at least n movies


private void printActorsWithAtLeastNMovies(final int n)
{
try ( Session session = driver.session() )
{
session.readTransaction((TransactionWork<Void>) tx -> {
String query = "MATCH (p:Person)-[:ACTED_IN]->(m:Movie) " +
"WITH p.name AS Name, count(*) AS NumMovies " +
"WHERE NumMovies >= $numMovies " +
"RETURN Name, NumMovies";
Result result = tx.run(query, parameters("numMovies", n));
while (result.hasNext()) {
Record r = result.next();
String name = r.get("Name").asString();
int nMovies = r.get("NumMovies").asInt();
System.out.println(name + " with " + nMovies + " movies");
}
return null;
});
}
}
//Find all the movies where the avg age of actors is lower than the youngest director by 20 years
private void printMoviesWithVeryOldDirectors(){
try ( Session session = driver.session() ){
// The WITH statement is useful because after return we cannot put another match without
closing the statement, if we want to keep the same statement we have to use WITH
session.readTransaction((TransactionWork<Void>) tx -> {
String query = "MATCH (p:Person)-[:ACTED_IN]->(m:Movie) " +
"WITH m.title AS Title, avg(m.released - p.born) AS AvgActorsAge " +
"MATCH (p2:Person)-[:DIRECTED]->(m2:Movie) WHERE m2.title = Title " +
"WITH m2.title AS Title, min(m2.released - p2.born) AS YoungestDirectorAge,
AvgActorsAge " +
"WHERE AvgActorsAge < YoungestDirectorAge - 20 " +
"RETURN Title, YoungestDirectorAge, AvgActorsAge";
Result result = tx.run(query);
List<Record> records = result.list(); //return a list of records
for (Record r: records) {
String title = r.get("Title").asString();
int youngestDirectorAge = r.get("YoungestDirectorAge").asInt();
double avgActorsAge = r.get("AvgActorsAge").asDouble();
System.out.println("\"" + title + "\". Youngest director is " + youngestDirectorAge + "
years old" + ", and the average actors' age is " + avgActorsAge);
}
return null;
});
}
}
//Find and show the shortest path between two actor which passes through a third actor
private void printShortestPathWithIntermediaryActor(final String firstActor, final String
lastActor,
final String middleActor){
try ( Session session = driver.session() ){
session.readTransaction((TransactionWork<Void>) tx -> {
String query = "MATCH path = shortestPath((:Person {name: $name1})-[*]-(:Person
{name: $name2})) " +
"WHERE ANY(x IN nodes(path) WHERE x.name = $name3) " +
"RETURN path";
// ANY will be true if any of the statements specified are true, we can also use NONE or ALL
or SINGLE (just one), for example

// If there is at least one member of the list to satisfy this condition, this will be true.
Result result = tx.run( query,
parameters("name1", firstActor, "name2", lastActor, "name3", middleActor));
// We can get the path (neo4j type)
Path path = result.single().get("path").asPath();
String pathString = "--";
for(Node n : path.nodes())
{
if(n.hasLabel("Person"))
{
pathString += n.get("name").asString() + "--";
}
else if(n.hasLabel("Movie"))
{
pathString += n.get("title").asString() + "--";
}
}
System.out.println("The length of the path is " + path.length());
System.out.println(pathString);
return null;
});
}
}
}

1. Implement the following queries in Neo4J Java, using the movies graph database
a) Find a title which contains a certain string
b) Find a movie by title and display the cast
c) Find the director that managed the least number of distinct actors
d) Find the top 3 movies with the highest ratio between number of actors and number of
producers
e) Find the shortest path from the youngest actor and the oldest director

package main.java;
import org.neo4j.driver.*;
import java.util.ArrayList;
import java.util.List;
import static org.neo4j.driver.Values.parameters;

public class NeoJavaEx implements AutoCloseable{


private final Driver driver;
public NeoJavaEx(String uri, String user, String password )
{
driver = GraphDatabase.driver( uri, AuthTokens.basic( user, password ) );
}

@Override
public void close() throws Exception
{
driver.close();
}

//Search function: find a title that contain a certain string


private void printMovieByPartOfTitle( final String partOfTitle)
{
try ( Session session = driver.session() )
{
List<String> movieTitles = session.readTransaction((TransactionWork<List<String>>) tx -> {
Result result = tx.run( "MATCH (m:Movie) WHERE toLower(m.title) CONTAINS
$partOfTitle" + " RETURN m.title as Title", parameters( "partOfTitle", partOfTitle) );
Arr ayList<String> movies = new ArrayList<>();
while(result.hasNext())
{
Record r = result.next();
movies.add(r.get("Title").asString());
}
return movies;
});
System.out.println("Movies that contain the substring '" + partOfTitle + "' are:");
for (String movieTitle: movieTitles)
{
System.out.println("\t- " + movieTitle);
}
}
}
//Find a movie by title and display the cast
private void printMovieCast( final String title)
{
try ( Session session = driver.session() )
{
session.readTransaction((TransactionWork<Void>) tx -> {
Result result = tx.run( "MATCH (p:Person)-[:ACTED_IN]->(m:Movie {title: $title}) " +
"RETURN m.title as Title, collect(p.name) as Cast",
parameters( "title", title) );
List<Object> cast = result.single().get("Cast").asList();
System.out.println("The cast of \"" + title + "\" is composed by:");
int i = 1;
for(Object member: cast)
{
System.out.println("\t" + i + ") " + member);
//we have to cast them and print them as String
i++;
}

return null;
});

}
}

//Find the director that managed the least number of distinct actors
private void printLeastPopularDirector()
{
try ( Session session = driver.session() )
{
session.readTransaction((TransactionWork<Void>) tx -> {
//If we use different variable names, we are stating the people are different
String query = "MATCH (p:Person)-[:DIRECTED]->(m)<-[:ACTED_IN]-(p2:Person)\n" +
"RETURN p.name AS Name, count(DISTINCT p2.name) AS numDistinctActors\n" +
"ORDER BY numDistinctActors\n" +
"LIMIT 1";
Result result = tx.run(query);
while(result.hasNext())
{
Record r = result.next();
String name = r.get("Name").asString();
int nActors = r.get("numDistinctActors").asInt();
System.out.println("The least popular director is " + name + ".");
System.out.println(name + " worked with only " + nActors + " actors");
}
return null;
});
}
}

//Find the top 3 movies with the highest ratio between number of producers and number of
actors
private void printTop3MoviesWithHighestActorsProducersRatio()
{
try ( Session session = driver.session() )
{
// we have to take care of duplicates
// for each movie we get the number of actors, then we want to match the movie produced
by a person, we need to use the where or we will have the cartesian product, we filter the match
by the previous match
session.readTransaction((TransactionWork<Void>) tx -> {
String query = "MATCH (m:Movie)-[:ACTED_IN]-(p:Person)\n" +
"WITH m.title AS Title, count(*) AS nActors\n" +
"MATCH (m2:Movie)-[:PRODUCED]-(p2:Person) WHERE m2.title = Title\n" +
"RETURN m2.title AS Title, toFloat(nActors / count(*)) AS Ratio, nActors\n" +
"ORDER BY Ratio DESC\n" +
"LIMIT 3";
Result result = tx.run(query);
while(result.hasNext())
{
Record r = result.next();
String title = r.get("Title").asString();
double ratio = r.get("Ratio").asDouble();
System.out.println("\"" + title + "\" with a ratio of " + ratio);
}
return null;
});
}
}

//Find the shortest path from the youngest actor and the oldest director
private void printShortestPathFromYoungestToOldest()
{
try ( Session session = driver.session() )
{
String youngestActor = session.readTransaction((TransactionWork<String>) tx -> {
//Find the youngest actor
String query = "MATCH (m:Movie)-[:ACTED_IN]-(p:Person) " +
"WHERE EXISTS(p.born) " + //if we don’t put it it will return actors with born = null,
                                                                     and for Neo4j null < any integer
"RETURN p.name as Name, p.born as Birth " +
"ORDER BY Birth DESC " +
"LIMIT 1";
Result result = tx.run(query);
Record r = result.next();
String name = r.get("Name").asString();
int birth = r.get("Birth").asInt();
System.out.println("The youngest actor is " + name + " who is born in " + birth);
return name;
});

String oldestDirector = session.readTransaction((TransactionWork<String>) tx -> {


//Find the oldest director
String query = "MATCH (m:Movie)-[:DIRECTED]-(p:Person) " +
"WHERE EXISTS(p.born) " +
"RETURN p.name as Name, p.born as Birth " +
"ORDER BY Birth ASC " +
"LIMIT 1";
Result result = tx.run(query);
Record r = result.next();
String name = r.get("Name").asString();
int birth = r.get("Birth").asInt();
System.out.println("The oldest director is " + name + " who is born in " + birth);
return name;
});

int l = session.readTransaction((TransactionWork<Integer>) tx -> {


//Find the shortest path between the youngest actor and the oldest director
String query = "MATCH path = shortestPath((p:Person {name: $name1})-[*]-(p2:Person
{name: $name2})) " + "RETURN length(path) AS PathLength";
Result result = tx.run(query, parameters("name1", youngestActor, "name2",
oldestDirector));
return result.single().get("PathLength").asInt();
});
System.out.println("The shortest path from " + youngestActor + " to " + oldestDirector + " is
" + l + " hops long");
}
}

public static void main( String... args ) throws Exception{


try ( NeoJavaEx neo4j = new NeoJavaEx( "neo4j://localhost:7687", "neo4j", "root" ) ){
neo4j.printMovieByPartOfTitle( "the");
neo4j.printMovieCast( "The Matrix");
neo4j.printLeastPopularDirector();
neo4j.printTop3MoviesWithHighestActorsProducersRatio();
neo4j.printShortestPathFromYoungestToOldest();
}
}

You might also like