BY
Deepak Mali
1
RIGI PUBLICATION
All right reserved
No part of this book may be reproduced in any form, by
Photostat, Microfilm, xerography, or any other means or
incorporated into any information retrieval system, Electronic
or Mechanical, Without the Written Permission of the
copyright owner.
Versatile Java
By
Deepak Mali
Copyright Deepak Mali 2016
Originally published in India
Published by RIGI PUBLICATION
777, Street no.9, Krishna Nagar
Khanna-141401 (Punjab), India
Website: www.rigipublication.com
Email: info@rigipublication.com
Phone: +91-9357710014, +91-9465468291
2
PREFACE
Working in the technology field for around a decade, I gained
significant experience working with market leaders such as
IBM, SAP, and Oracle. Recently I started to work in the
Investment Banking Domain managing large scale projects and
ensuring a smooth delivery.
Throughout the last decade, I have seen Java evolving like
anything. From the early days of Java 1.4 to the increased focus
on Machine Learning, Big Data Analytics, Functional
Programming, Concurrency, Microservices, Modularity,
Cryptography, and the list goes on.
I have made an attempt to accumulate my learnings in the
diverse or rather versatile Java Technology areas at one place.
I hope this will be a ready reference for professionals working
on these technologies.
This book wouldnt have seen the light of the day without the
blessings of my parents Prof. S.L. Mali and Mrs. Sita and a
constant support and encouragement of my wife Mrs. Shikha
who is herself a computer enthusiast and helped me a lot with
the proofreading and corrections to make the book better. My
sisters Dr. Sunita, Er. Vinita and Dr. Chetna have been my
constant source of inspiration throughout the journey.
I also pay my warm regards to Rigi Publication for their
cooperation in bringing up this book at a fast pace.
3
I will be happy to learn about any errors which could have crept
in by mistake and suggestions for further improvements of the
book .My email ID is ssm.deepak@gmail.com
Thanks and happy reading...
4
TABLE OF CONTENTS
1. Functional Programming using lambda expressions 6
2. Concurrency in Java 22
3. Network Programming using java 52
4. Microservices Using Spring Cloud 95
5. Big Data Technology (Hadoop) 128
6. Crypto- Algorithms using Java 164
7. Modularity in Java 185
8. Machine Learning using Spark 213
9. Genetic Algorithms using Java 282
10. Natural Language Processing 325
11. Performance Enhancements in Java 340
5
CHAPTER 1 - FUNCTIONAL PROGRAMMING
USING LAMBDA EXPRESSIONS
Getting Started
From the inception of Java to recent times, there were thoughts
around the functional programming aspect of Java .This has
become more evident in the recent times with the introduction
of lambda expressions .Functional Programming helps us
achieve an abstraction over the behavior. Needless to say, the
idea brings many goodies in the bag in form of writing logic
which is more readable to get the business intent behind the
program, rather than writing tons of boiler plate code to define
the mechanics behind the same. Functional Programming is
more about writing clean code to solve a specific business
domain problem statement in terms of immutable values and
functions.
To get a flavor of Lambda expressions lets take an example of
a Class implementing Runnable interface in the below snippet.
classTest_Thread implements Runnable {
@Override
public void run() {
// TODO Auto-generated method stub
System.out.println("Hello There");
6
A thread invoking the above Runnable target can be spawned
using the below line of Program.
Thread t = new Thread (new Test_Thread());
t.start();
Console Output is: Hello There
This is an example where we try to send code as data .However;
we have ended up writing a lot of boiler plate code.
Additionally, the intent here was to pass on the behavior which
the thread would execute.A much more concise and intent
conveying way would be one where we could pass on behavior
in a precise way Here Lambda Expressions come to our
rescue!
The above program could be modified as below:
Thread t = new Thread (() ->System.out.println ("Hello
Lambda Runnable"));
t.start();
Console Output is: Hello Lambda Runnable
Highlighted is the Lambda expression implementing Runnable
.It takes no arguments and its return type is void. Here, the
implementation can also be a block of code as per the below:
Thread t = new Thread(() -> {
System.out.println("Hello Lambda");
System.out.println("Runnable");
});
7
Console Output is:
Hello Lambda
Runnable
Lets take another example of a Swing based program to
understand the concept further. In the below code snippet, an
event listener is registered on a button.
button.addActionListener(new ActionListener() {
public void actionPerformed(ActionEvent event) {
System.out.println("button clicked");
});
Here, the anonymous inner class, Action Listener provides
implementation of the single methodof interface,
actionPerformed. Instead of passing the object implementing
the interface ActionListener, we could pass the function to the
event listener in the form of Lambda Expression.
button.addActionListener(event ->System.out.println("button
clicked"));
The event is the same parameter as passed to the
actionPerformed method and is separated by the body of the
expression the resultant of the event trigger which is to print
button clicked on the console.
Lambda expressions can also be used to represent methods .For
example - BinaryOperator is a functional interface (we will look
8
into functional interfaces later) defined to represent an add
method.
BinaryOperator<Integer> add = (x, y) -> x + y;
We will explore later that functional interfaces have functional
methods .For Binary Operator, the functional method is
apply(Object, Object).What this means is ,in case we need to
add two integers 4 and 5 , we can invoke the add operation here
by writing add.apply (4,5).
The type of the arguments is inferred by the compiler in the
above case, however at times we need to explicitly define the
types as below:
BinaryOperator<Long> add = (Long u,Long v )->u+v;
final or effectively final variables
Variables used in Lambda Expressions should either be marked
explicitly final or they are effectively final. Thus, in case we
assign a variable from a POJO getter call and utilize the
variable in the lambda expression we have written earlier, the
value assigned to the variable gets printed to the console
.However, in case we try reassigning the variable name
between lines (1) and (2) , the compiler will not allow saying
Local variable name defined in an enclosing scope must be final or
effectively final
String name = getName(); (1)
button.addActionListener(event->System.out.println(name));
(2)
Here, the variable, although not declared final explicitly still
behaves as final or in other words it is effectively final
9
Since the unassigned variables are in a way closed by the
lambda expressions over the surrounding state, they are at times
referred to as closures as well.
Functional Interfaces
Functional interfaces are the interfaces with a single (abstract)
method and we can create an instance of a functional interface
using lambda expressions. It does not matter what is the exact
name of the method, it will be matched with the lambda
expression as long as the signature is compatible. Functional
interfaces can act as assignment targets for lambda expressions.
The below table lists the more frequently used functional
interfaces bundled in JDK.
(T - the type of the input to the function R - the type of the
result of the function)
Functional Interface Functional method
@FunctionalInterface accept(Object)
public interface Consumer<T>
It accepts a single input
argument and returns no result
@FunctionalInterface test(Object)
public interface Predicate<T>
Represents a predicate
(boolean-valued function) of
one argument.
10
@FunctionalInterface apply(Object)
public interface
Function<T,R>
Represents a function that
accepts one argument and
produces a result
@FunctionalInterface get()
public interface Supplier<T>
Represents a supplier of
results
@FunctionalInterface Function.apply(Object)
public interface
UnaryOperator<T>
extends Function<T,T>
Represents an operation on a
single operand that produces a
result of the same type as its
operand.
@FunctionalInterface BiFunction.apply(Object,
Object)
public interface
BinaryOperator<T>
extends BiFunction<T,T,T>
Represents an operation upon
two operands of the same
11
type, producing a result of the
same type as the operands
Type inference in lambdas
Lets take an example of Predicate functional interface which
checks if the expression is true or not.
Predicate<Integer> isGreaterThan10 = x -> x > 10;
The Predicate has a single generic type Integer, thus the
variable used in the body of lambda is inferred by the compiler
to be an Integer. Here the base of compiler inferring the type is
the generic to which the Predicate is tied to .In case we omit this
information, the compiler will have tough time indentifying the
types .The compiler will consider the operands to be of the base
type i.e. java.lang.Object , resulting in an error message The
operator + is undefined for the argument type(s) java.lang.Object,
java.lang.Object. The below snippet demonstrates the compiler
error situation.
BinaryOperator add = (x,y)->x+y;
Internal and external iteration usage of streams
Until now, the way we used to iterate over the collection is to
make use of the iterator method .In case there are many
occurrences of iterating over the collection, not only we end up
writing a lot of boiler plate code but the way it executes is serial
in nature. A snippet for external iteration is shown below where
we try to iterate over the ArrayList of trades .For simplicity,
each trade has two attributes trade reference and trade region.
The intent of the below code is to get the count of all the trades
which have region as EMEA.
12
ArrayList<Trade> trades = new ArrayList<Trade>();
trades.add(new Trade(100, "EMEA"));
trades.add(new Trade(101, "Americas"));
trades.add(new Trade(102, "APAC"));
trades.add(new Trade(103, "EMEA"));
trades.add(new Trade(104, "EMEA"));
trades.add(new Trade(105, "APAC"));
trades.add(new Trade(106, "EMEA"));
long count = 0;
String trade_region = null;
Iterator<Trade>trades_iterator = trades.iterator();
while (trades_iterator.hasNext()) {
trade_region = trades_iterator.next().getTrade_region();
if (trade_region.equals("EMEA"))
count++;
When we print the count on the console, it should give us 4 as
there are 4 trades with EMEA as the region.As you have
correctly noticed, its also tough to predict the behavioral
operation looking at the code.A better and more efficient way of
achieving the same objective would be to use streams. Streams
are used for collection processing adhering to the functional
13
programming paradigm.Thus, the above code can be modified
as below
long l = trades.stream().filter(trade -
>trade.isTradeRegion("EMEA"))
.count();
Here, the filter method returns us the filtered stream. Count
counts the number of elements in the filtered stream. Another
notable difference is that the filter method does not return a new
value at the end it just returns a filtered stream sequence .Such
methods are called as lazy method .But the count method does
return as final value and hence termed as eager methods. The
preferred way is to have a chain of lazy operations with an
eager operation at the end of the chain. This reduces the number
of times we have to iterate over the collection.
A few of uses of eager and terminal operations are given below:
collect(toList())
List<String>trades_collected=Stream.of("Trade1","Trade2","Tr
ade3"). collect(Collectors.toList());
The of() is the factory method to return the sequential stream
and collect performs a mutable reduction using Collector.
Collectors is the implementation of Collector utilities.
map
map is applied over a stream sequence to transform give stream
to a stream containing new values .Since there is an input to the
stream and a result from the stream, the lambda expression used
here is an instance of Function functional interface .
14
List<String>trades_mapped=Stream.of("trade1","trade2","trade
3").map(String->String.toUpperCase()).collect(Collectors.toList());
The mapped stream is [TRADE1, TRADE2, TRADE3].
Additionally, flatMaplets us replace value with stream and
combines the streams together. max and min are used to find the
max and in element of the stream using comparing method of
Comparator.
reduce operation is used when we want a collection of values
upon we need to perform an operation defined by the lambda
expression yielding the final result.
At the end, the stack_up variable contains the result.
intreduce_operation_value = Stream.of(1, 2, 3)
.reduce(0, (stack_up, element) ->stack_up + element);
Console Output: 6
As a thump rule, always write the stream based code in a
chained fashion to avoid creating intermediate collection
objects. This will also help taking the benefits of API to
automatically parallelize the operations. In the above examples,
we have noticed that a function is either used an an argument or
as a return type-such functions are also at times referred as
higher order functions.
Method References
We can make a method reference using
Classname::methodname .We have not used the parenthesis
here since this is the equivalent of the method of the lambda
15
expression .We can also use the abbreviated notation for new
Object in the following way .
TreeSet <String> trades_collected_tree_Set =
Stream.of("xyz","pqr","abc").collect(Collectors.toCollection(Tr
eeSet::new));
trades.stream().collect(Collectors.averagingLong(Trade::getTra
saction_amount));
System.out.println(reduce_operation_value);
Console Output :[abc, pqr, xyz]
Element Ordering
When we create a Stream using a defined order from a
collection , the Stream has a defined encounter order .If there is
no defined order to begin , the Stream produced by that source
has no defined order .Certain operations on the Stream creates
an encounter order like Sort operations .
// encounter order is maintained
List<Integer>numbers=asList(1,2,3,4);
List<Integer>sameOrder=numbers.stream().collect(toList());
// encounter order is different
Set<Integer>numbers=newHashSet<>(asList(4,3,2,1));
List<Integer>sameOrder=numbers.stream()
.collect(toList());
16
// sorting ensures encounter order
Set<Integer> numbers = new HashSet<>(asList(4, 3, 2, 1));
List<Integer> sameOrder = numbers.stream()
.sorted()
.collect(toList());
At times post the Stream operations we want to produce
Collection as a final value .This may be required when we need
to get a TreeSet as a resultant collection .It is also possible to
collect into a single value from the collector .
TreeSet <String> trades_collected_tree_Set =
Stream.of("xyz","pqr","abc").collect(Collectors.toCollection(Tr
eeSet::new));
Console Output :[abc, pqr, xyz]
// Fetching the average transaction amount of the trades
trades.stream().collect(Collectors.averagingLong(Trade::getTra
saction_amount));
We can also group the data which the Stream has given us using
the groupingBy() method .
// the below code group the Stream data based on the Trader.
trades.stream().collect(Collectors.groupingBy(Trade::getTrader
));
We can build up String using Streams using joining() method as
follows.
17
String
result=Stream.of("Trade1","Trade2","Trade3").collect(Collecto
rs.joining(",","[","]"));
Console Output: [Trade1,Trade2,Trade3]
Static methods on Interfaces
Weve seen a lot of calling of Stream.of but havent gotten into
its details yet. You may recall that Stream is an interface, but
this is a static method on an interface. This is another new
language change that has made its way into Java 8, primarily in
order to help library developers, but with benefits for day-to-
day application developers as well.
Optional is a new core library data type that is designed to
provide a better alternative to null.
Below is the snippet demonstrating the usage of Optional
Optional<String> a = Optional.of("a");
System.out.println(a.get());
Optional<String> b =Optional.empty();
System.out.println(b.isPresent());
Optional<String> c = Optional.ofNullable(null);
if(c.isPresent())
System.out.println(c.get());
Console Output : a
false
<No Output>
18
Parallel Stream Operations
Making an operation execute in parallel using the streams
library is a matter of changing a single method call. If you
already have a Stream object, then you can call
its parallel method in order to make it parallel. If youre
creating a Stream from a Collection, you can call
the parallelStream method in order to create a parallel stream
from the get-go.The following Snippet gives us a good idea
about the same .
publicstaticvoid main (String args[])
ArrayList<Trade>list_of_trades = new
ArrayList<Trade>();
for(inti=0;i<=100;i++)
list_of_trades.add(new Trade(1000000+i,
"APAC"));
for(inti=101;i<=200;i++)
list_of_trades.add(new Trade(1000000+i,
"EMEA"));
for(inti=201;i<=300;i++)
list_of_trades.add(new Trade(1000000+i,
"NA"));
for(inti=301;i<=400;i++)
list_of_trades.add(new Trade(1000000+i,
"JAPAN"));
19
ConcurrentMap<Long, List<Trade>>map =
list_of_trades.parallelStream()
.filter(r ->r.getRegion().equals("EMEA"))
.collect(Collectors.groupingByConcurrent
(Trade::getTrade_id));
map.forEach((k,l) -> System.out.println("Trade
ID =" + k + "Trade value :" ));
class Trade
privatelongtrade_id ;
private String region;
public Trade(longtrade_id, String region) {
super();
this.trade_id = trade_id;
this.region = region;
publiclong getTrade_id() {
returntrade_id;
20
publicvoid setTrade_id(longtrade_id) {
this.trade_id = trade_id;
public String getRegion() {
returnregion;
publicvoid setRegion(String region) {
this.region = region;
21
CHAPTER2 CONCURRENCY IN JAVA
Getting Started
Concurrency and parallelism are very similar concepts.
Different authors give different definitions to these concepts.
The most accepted definition talks about concurrency when you
have more than one task in a single processor with a single core
and the operating system's task scheduler quickly switches from
one task to another, so it seems that all the tasks run
simultaneously. The same definition talks about parallelism
when you have more than one task that run simultaneously at
the same time, in a different computer, processor, or core inside
a processor.
Another definition talks about concurrency when you have
more than one task (different tasks) running simultaneously on
your system. One more definition discusses parallelism when
you have different instances of the same task running
simultaneously over different parts of a dataset.
The last definition that we include talks about parallelism when
you have more than one task that runs simultaneously in your
system and talks about concurrency to explain the different
techniques and mechanisms programmers have to synchronize
with the tasks and their access to shared resources.
As you can see, both concepts are very similar and this
similarity has increased with the development of multicore
processors.
22
Synchronization and Immutable Objects
In concurrency, we can define synchronization as the
coordination of two or more tasks to get the desired results. We
have two kinds of synchronization:
Control synchronization: When, for example, one task
depends on the end of another task, the second task can't
start before the first has finished
Data access synchronization: When two or more tasks have
access to a shared variable and only one of the tasks can
access the variable at any given time
A concept closely related to synchronization is critical section.
A critical section is a piece of code that can be only executed by
a task at any given time because of its access to a shared
resource. Mutual exclusion is the mechanism used to guarantee
this requirement and can be implemented by different ways.
Keep in mind that synchronization helps you avoid some errors
you can have with concurrent tasks (they will be described later
in this chapter), but it introduces some overhead to your
algorithm. You have to calculate very carefully the number of
tasks, which can be performed independently without
intercommunication in your parallel algorithm. It's
the granularity of your concurrent algorithm. If you have
a coarse-grained granularity (big tasks with low
intercommunication), the overhead due to synchronization will
be low. However, maybe you won't benefit all the cores of your
system. If you have a fine-grained granularity (small tasks
with high intercommunication), the overhead due to
synchronization will be high and maybe the throughput of your
algorithm won't be good.
23
There are different mechanisms to get synchronization in a
concurrent system. The most popular mechanisms from a
theoretical point of view are:
Semaphore: A semaphore is a mechanism that can be used
to control the access to one or more units of a resource. It
has a variable that stores the number of resources that can be
used and two atomic operations to manage the value of the
variable. A mutex (short for mutual exclusion) is a special
kind of semaphore that can take only two values (resource is
free and resource is busy), and only the process that sets the
mutex to busy can release it.
Monitor: A monitor is a mechanism to get mutual exclusion
over a shared resource. It has a mutex, a condition variable,
and two operations to wait for the condition and to signal the
condition. Once you signal the condition, only one of the
tasks that are waiting for it continues with its execution.
The last concept related to synchronization you're going to learn
in this chapter is thread safety. A piece of code (or a method or
an object) is thread-safe if all the users of shared data are
protected by synchronization mechanisms, a
nonblocking compare-and-swap (CAS) primitive or data is
immutable, so you can use that code in a concurrent application
without any problem.
An immutable object is an object with a very special
characteristic. You can't modify its visible state (the value of its
attributes) after its initialization. If you want to modify an
immutable object, you have to create a new one.
Its main advantage is that it is thread-safe. You can use it in
concurrent applications without any problem.
24
An example of an immutable object is the String class in Java.
When you assign a new value to a String object, you are
creating a new string
An atomic operation is a kind of operation that appears to
occur instantaneously to the rest of the tasks of the program. In
a concurrent application, you can implement an atomic
operation with a critical section to the whole operation using a
synchronization mechanism.
An atomic variable is a kind of variable with atomic operations
to set and get its value. You can implement an atomic variable
using a synchronization mechanism or in a lock-free manner
using CAS, which doesn't need any synchronization
Tasks can use two different methods to communicate with each
other. The first one is shared memory, and normally it is used
when the tasks are running in the same computer. The tasks use
the same memory area where they write and read values. To
avoid problems, the access to this shared memory has to be in a
critical section protected by a synchronization mechanism.
The other synchronization mechanism is message
passing and normally is used when the tasks are running in
different computers. When a task needs to communicate with
another, it sends a message that follows a predefined protocol.
This communication can be synchronous if the sender is
blocked waiting for a response or asynchronous if the sender
continues with their execution after sending the message.
25
Possible Problems in Concurrent Applications
Data Race
You can have a data race (also named race condition) in your
application when you have two or more tasks writing a shared
variable outside a critical sectionthat's to say, without using
any synchronization mechanisms.
Under these circumstances, the final result of your application
may depend on the order of execution of the tasks. Look at the
following example:
package com.packt.java.concurrency;
public class Account {
private float balance;
public void modify (float difference) {
float value=this.balance;
this.balance=value+difference;
}
}
Imagine that two different tasks execute the modify() method in
the same Account object. Depending on the order of execution
of the sentences in the tasks, the final result can vary. Suppose
that the initial balance is 1000 and the two tasks call
the modify() method with 1000 as a parameter. The final result
should be 3000, but if both tasks execute the first sentence at
the same time and then the second sentence at the same time,
the final result will be 2000. As you can see,
the modify() method is not atomic and the Account class is not
thread-safe.
26
Deadlock
There is a deadlock in your concurrent application when there
are two or more tasks waiting for a shared resource that must be
free from the other, so none of them will get the resources they
need and will be blocked indefinitely. It happens when four
conditions happen simultaneously in the system. They
are Coffman's conditions, which are as follows:
Mutual exclusion: The resources involved in the deadlock
must be nonshareable. Only one task can use the resource at
a time.
Hold and wait condition: A task has the mutual exclusion
for a resource and it's requesting the mutual exclusion for
another resource. While it's waiting, it doesn't release any
resources.
No pre-emption: The resources can only be released by the
tasks that hold them.
Circular wait: There is a circular waiting where Task 1 is
waiting for a resource that is being held by Task 2, which is
waiting for a resource being held by Task 3, and so on until
we have Task n that is waiting for a resource being held by
Task 1.
There exist some mechanisms that you can use to avoid
deadlocks:
Ignore them: This is the most commonly used mechanism.
You suppose that a deadlock will never occur on your
system, and if it occurs, you can see the consequences of
stopping your application and having to re-execute it.
27
Detection: The system has a special task that analyzes the
state of the system to detect if a deadlock has occurred. If it
detects a deadlock, it can take action to remedy the problem.
For example, finishing one task or forcing the liberation of a
resource.
Prevention: If you want to prevent deadlocks in your
system, you have to prevent one or more of Coffman's
conditions.
Avoidance: Deadlocks can be avoided if you have
information about the resources that are used by a task
before it begins its execution. When a task wants to start its
execution, you can analyze the resources that are free in the
system and the resources that the task needs to decide that it
can start its execution or not.
Livelock
A livelock occurs when you have two tasks in your systems that
are always changing their states due to the actions of the other.
Consequently, they are in a loop of state changes and unable to
continue.
For example, you have two tasksTask 1 and Task 2and
both need two resources: Resource 1 and Resource 2. Suppose
that Task 1 has a lock on Resource 1, and Task 2 has a lock on
Resource 2. As they are unable to gain access to the resource
they need, they free their resources and begin the cycle again.
This situation can continue indefinitely, so the tasks will never
end their execution.
28
Resource starvation
Resource starvation occurs when you have a task in your
system that never gets a resource that it needs to continue with
its execution. When there is more than one task waiting for a
resource and the resource is released, the system has to choose
the next task that can use it. If your system has not got a good
algorithm, it can have threads that are waiting for a long time
for the resource.
Fairness is the solution to this problem. All the tasks that are
waiting for a resource must have the resource in a given period
of time. An option is to implement an algorithm that takes into
account the time that a task has been waiting for a resource
when it chooses the next task that will hold a resource.
However, fair implementation of locks requires additional
overhead, which may lower your program throughput.
Priority inversion
Priority inversion occurs when a low-priority task holds a
resource that is needed by a high-priority task, so the low-
priority task finishes its execution before the high-priority task.
Executors
The executor framework is a mechanism that allows you to
separate thread creation and management for the
implementation of concurrent tasks. You don't have to worry
about the creation and management of threads, only about
creating tasks and sending them to the executor. The main
classes involved in this framework are:
The Executor and ExecutorService interface:
They include methods common to all executors.
29
ThreadPoolExecutor: This is a class that allows you to get an
executor with a pool of threads and optionally define a
maximum number of parallel tasks
ScheduledThreadPoolExecutor: This is a special kind of
executor to allow you to execute tasks after a delay or
periodically
Executors: This is a class that facilitates the creation of
executors
The Callable interface: This is an alternative to
the Runnable interfacea separate task that can return a
value
The Future interface: This is an interface that includes the
methods to obtain the value returned by a Callable interface
and to control its status
In order to write custom executors , we need to override the
below methods :
protected void afterExecute(Runnable r, Throwable t)
protected void beforeExecute(Thread t, Runnable r)
Below is a sample snippet using executors
package com.java8.tutorial.threads;
import java.util.concurrent.Callable;
import java.util.concurrent.ExecutionException;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
30
import java.util.concurrent.Future;
publicclass ExecutorTest {
publicstaticvoid main(String args[])
Future<String>future = null;
ExecutorService executorService =
Executors.newFixedThreadPool(20);
for (intcounter = 1; counter<= 20; counter++) {
future = executorService.submit(new
Callable<String>() {
public String call() {
return
(Thread.currentThread().getName() + " Asynchronous Callable
");
});
try {
System.out.println(future.get()
+ counter);
} catch (InterruptedException |
ExecutionException e) {
e.printStackTrace();
31
}
Fork Join Framework
The Fork/Join framework defines a special kind of
executor specialized in the resolution of problems with the
divide and conquer technique. It includes a mechanism to
optimize the execution of the concurrent tasks that solve these
kinds of problems. Fork/Join is specially tailored for fine-
grained parallelism as it has a very low overhead in order to
place the new tasks into the queue and take queued tasks for
execution. The main classes and interfaces involved in this
framework are:
ForkJoinPool: This is a class that implements the executor
that is going to run the tasks
ForkJoinTask: This is a task that can be executed in
the ForkJoinPool class
ForkJoinWorkerThread: This is a thread that is going to
execute tasks in the ForkJoinPool class
Following snippet demonstrates the fork join concept
package com.java8.tutorial.threads;
import java.util.concurrent.ForkJoinPool;
import java.util.concurrent.RecursiveAction;
32
import java.util.concurrent.RecursiveTask;
public class ForkJoinTest {
public static void main (String args[])
ForkJoinPool pool = ForkJoinPool.commonPool();
int arr_int[]=new int[4001] ;
for (int i=0 ;i<=4000;i++)
arr_int[i]=i;
long final_ans =pool.invoke(new
Task(0,arr_int.length,arr_int));
System.out.println(final_ans);
class Task extends RecursiveTask<Long>
{ int lower_index;
int upper_index;
int[] int_array;
33
static final int UPPER_BOUND = 2000;
Task(int low, int high, int[] arr)
{ lower_index=low;
upper_index=high;
int_array = arr;
@Override
protected Long compute() {
if (upper_index-lower_index <= UPPER_BOUND)
{long sum =0;
for (int i= 0;i<upper_index;i++)
sum+=int_array[i];
System.out.println("Simple
Computation");
return sum;
else
int mid = lower_index + (upper_index -
lower_index)/2;
34
Task left = new Task(lower_index,
mid, int_array);
Task right = new Task(mid,
upper_index, int_array);
left.fork();
long rightans= right.compute();
long leftans= left.join();
System.out.println("Fork Join is executed ");
return leftans+rightans;
}
Producer and Consumers
A classic example of thread communication involving
conditions is the relationship between a producer thread and a
consumer thread. The producer thread produces data items to be
consumed by the consumer thread. Each produced data item is
stored in a shared variable.
Imagine that the threads are running at different speeds. The
producer might produce a new data item and record it in the
shared variable before the consumer retrieves the previous data
item for processing. Also, the consumer might retrieve the
contents of the shared variable before a new data item is
produced.
35
To overcome those problems, the producer thread must wait
until its notified that the previously produced data item has
been consumed, and the consumer thread must wait until its
notified that a new data item has been produced. Below
code shows you how to accomplish this task
via wait() and notify().
package com.java8.tutorial.threads;
publicclass ProducerConsumer {
privatecharc ;
privatevolatilebooleanwritable=true ;
publicsynchronizedvoid setC(charc){
while(!writable){
try {
wait();
} catch (InterruptedException e) {
e.printStackTrace();
this.c=c;
writable = false ;
notify();
36
publicsynchronizedchar getC(){
while (writable){
try {
wait();
} catch (InterruptedException e) {
e.printStackTrace();
writable = true;
notify();
returnc;
publicstaticvoid main(String args [])
ProducerConsumer pc =newProducerConsumer();
Producer p = new Producer (pc);
Consumer c = new Consumer (pc);
p.start();
c.start();
37
}
class Producer extends Thread
ProducerConsumer pc;
Producer(ProducerConsumer pc){
this.pc=pc;
publicvoid run (){
for (charc = 'A' ; c<='Z'; c++){
pc.setC(c);
System.out.println("Producer" + c);
}}
class Consumer extends Thread {
ProducerConsumer pc ;
Consumer (ProducerConsumer pc)
this.pc =pc;
publicvoid run ()
38
{
charc;
do {
c =pc.getC();
System.out.println("Consumer" + c);
while (c !='Z');
Phasers
A phaser is a more flexible cyclic barrier. Like a cyclic barrier,
a phaser lets a group of threads wait on a barrier; these threads
continue after the last thread arrives. A phaser also offers the
equivalent of a barrier action. Unlike a cyclic barrier, which
coordinates a fixed number of threads, a phaser can coordinate a
variable number of threads, which can register at any time. To
implement this capability, a phaser uses phases and phase
numbers.
A phase is the phasers current state, and this state is
identified by an integer-based phase number. When the last of
the registered threads arrives at the phaser barrier, a phaser
advances to the next phase and increments its phase number by
1.
39
The java.util.concurrent.Phaser class implements a phaser.
Because this class is thoroughly described in its Javadoc, Ill
point out only a few constructor and methods:
The Phaser(int threads) constructor creates a phaser that
initially coordinates nthreads threads (which have yet to
arrive at the phaser barrier) and whose phase number is
initially set to 0.
The int register() method adds a new unarrived thread to this
phaser and returns the phase number to classify the arrival.
This number is known as the arrival phase number.
The int arriveAndAwaitAdvance() method records arrival
and waits for the phaser to advance (which happens after the
other threads have arrived). It returns the phase number to
which the arrival applies.
The int arriveAndDeregister() method arrives at this phaser
and deregisters from it without waiting for others to arrive,
reducing the number of threads required to advance in future
phases.
Following snippet shows the phaser code
package com.java8.tutorial.threads;
import java.util.concurrent.Phaser;
publicclass PhasorTest {
Publicstaticvoid main (String args []) throws
InterruptedException
40
Phaser phasor =new Phaser();
phasor.register();
System.out.println(phasor.getPhase());
new
PhasorTest().addPhasorTask(phasor,2000);
new
PhasorTest().addPhasorTask(phasor,4000);
new
PhasorTest().addPhasorTask(phasor,6000);
phasor.arriveAndDeregister();
Thread.sleep(10000);
System.out.println(phasor.getPhase());
void addPhasorTask(Phaser phasor,intsleeptime)
phasor.register();
new Thread (){
publicvoid run ()
System.out.println (Thread.currentThread().getName());
phasor.arriveAndAwaitAdvance();
41
try {
Thread.sleep(sleeptime);
}catch(InterruptedException e)
{
// TODO Auto-generated catch block
e.printStackTrace();
System.out.println("After the Barrier"
+ Thread.currentThread().getName());
}.start();
//main controls the others .others finished the
execution and main controls the barrier
Concurrent Data Structures
Normal data structures of the Java API (ArrayList, Hashtable,
and so on) are not ready to work in a concurrent application
unless you use an external synchronization mechanism. If you
use it, you will be adding a lot of extra computing time to your
application. If you don't use it, it's probable that you will have
race conditions in your application. If you modify them from
several threads and a race condition occurs, you may experience
various exceptions thrown (such
42
as, ConcurrentModificationException and ArrayIndexOutOfBo
undsException), there may be silent data loss or your program
may even stuck in an endless loop.
The Java concurrency API includes a lot of data structures that
can be used in concurrent applications without risk. We can
classify them in two groups:
Blocking data structures: These include methods that block
the calling task when, for example, the data structure is
empty and you want to get a value.
Non-blocking data structures: If the operation can be
made immediately, it won't block the calling tasks.
Otherwise, it returns the null value or throws an exception.
These are some of the data structures:
ConcurrentLinkedDeque: This is a non-blocking list
ConcurrentLinkedQueue: This is a non-blocking queue
LinkedBlockingDeque: This is a blocking list
LinkedBlockingQueue: This is a blocking queue
PriorityBlockingQueue: This is a blocking queue that orders
its elements based on its priority
ConcurrentSkipListMap: This is a non-blocking navigable
map
ConcurrentHashMap: This is a non-blocking hash map
AtomicBoolean, AtomicInteger, AtomicLong,
and AtomicReference: These are atomic implementations of
the basic Java data types
43
A snippet describing blocking queue is below
package com.java8.tutorial.threads;
import java.util.concurrent.ArrayBlockingQueue;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
public class BlockingQueue {
public static void main (String args [])
ArrayBlockingQueue<Integer> blocking_queue = new
ArrayBlockingQueue<Integer>(20);
ExecutorService service = Executors.newFixedThreadPool(2);
Runnable producer= () -> {
for (Integer i =1 ; i <= 100 ; i++)
try {
blocking_queue.put(i);
System.out.println (i + " Produced By Producer");
} catch (InterruptedException e) {
e.printStackTrace();
44
}
};
service.execute(producer);
Runnable consumer = () -> {
Integer j = 0;
do {
try {
j=blocking_queue.take();
System.out.println (j + " Consumed By Consumer");
} catch (InterruptedException e) {
e.printStackTrace();
while (j != 100);
service.shutdown();
};
service.execute(consumer);
Completion Service
45
A completion service is an implementation of
the java.util.concurrent.CompletionService<V> interface that
decouples the production of new asynchronous tasks (a
producer) from the consumption of the results of completed
tasks (a consumer). V is the type of a task result.
A producer submits a task for execution (via a worker
thread) by calling one of the submit() methods: one method
accepts a callable argument and the other method accepts a
runnable argument along with a result to return upon task
completion. Each method returns a Future<V> instance that
represents the pending completion of the task. You can then call
a poll() method to poll for the tasks completion or call the
blocking take() method.
A consumer takes a completed task by calling
the take() method. This method blocks until a task has
completed. It then returns a Future<V> object that represents
the completed task. You would
call Future<V>s get() method to obtain this result.
Along with CompletionService<V>, Java 7 introduced
the java.util.concurrent.ExecutorCompletionService<V> class t
o support task execution via a provided executor. This class
ensures that, when submitted tasks are complete, they are
placed on a queue thats accessible to take ().A snippet for
completion service is below.
package com.java8.tutorial.threads;
import java.math.BigDecimal;
import java.math.MathContext;
import java.math.RoundingMode;
46
import java.util.concurrent.Callable;
import java.util.concurrent.CompletionService;
import java.util.concurrent.ExecutionException;
import java.util.concurrent.ExecutorCompletionService;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.Future;
public class CompletionServiceTest {
public static void main (String args []) throws
InterruptedException, ExecutionException
ExecutorService es = Executors.newFixedThreadPool(10);
CompletionService<BigDecimal> cs =
new ExecutorCompletionService<BigDecimal>(es);
cs.submit(new CalculateE(17));
cs.submit(new CalculateE(170));
Future<BigDecimal> result = cs.take();
System.out.println(result.get());
System.out.println();
result = cs.take();
47
System.out.println(result.get());
es.shutdown();
class CalculateE implements Callable<BigDecimal>
final int lastIter;
public CalculateE(int lastIter)
this.lastIter = lastIter;
@Override
public BigDecimal call()
MathContext mc = new MathContext(100,
RoundingMode.HALF_UP);
BigDecimal result = BigDecimal.ZERO;
for (int i = 0; i <= lastIter; i++)
BigDecimal factorial = factorial(new BigDecimal(i));
48
BigDecimal res = BigDecimal.ONE.divide(factorial,
mc);
result = result.add(res);
return result;
private BigDecimal factorial(BigDecimal n)
if (n.equals(BigDecimal.ZERO))
return BigDecimal.ONE;
else
return n.multiply(factorial(n.subtract(BigDecimal.ONE)));
Using streams to collect the Data
A stream is formed by the following three main elements:
A source that generates stream elements
Zero or more intermediate operations that generate output as
another stream
49
One terminal operation that generates a result that could be
either a simple object, array, collection, map, or anything
else
The Stream API provides different terminal operations, but
there are two more significant operations for their flexibility
and power.
The collect() method allows you to transform and group the
elements of the stream generating a new data structure with the
final results of the stream. You can use up to three different data
types: an input data type, the data type of the input elements
that come from the stream, an intermediate data type used to
store the elements while the collect() method is running, and an
output data type returned by the collect() method.The following
method shows an example
publicclass MapCollect {
publicstaticvoid main (String args[])
Path path
=FileSystems.getDefault().getPath("C://Users//malidee//thread_
strem.txt");
try {
List<String>lines =
Files.readAllLines(path,StandardCharsets.UTF_8);
List<String[]>records=
lines.parallelStream().skip(1).map(t ->t.toUpperCase()).map(l -
>l.split(";"))
50
.collect(Collectors.toList());
Iteratoriter = records.iterator();
while (iter.hasNext()){
String[] str=(String[]) iter.next();
for(inti=0;i<str.length;i++)
System.out.print( str[i]);
System.out.println();
} catch (IOException e) {
e.printStackTrace();
51
CHAPTER 3 - NETWORK PROGRAMMING USING
JAVA
Getting Started
Access to networks (the Internet in particular) is becoming an
important and often necessary feature of applications.
Applications frequently need to access and provide services. As
the Internet of Things (IoT) connects more and more devices,
understanding how to access networks becomes crucial.
The important factors that have been the driving forces for more
network applications include the availability of faster networks
with greater bandwidth. This has made it possible to transmit
wider ranges of data, such as video streams. In recent years, we
have seen an increase in connectivity, whether it has been for
new services, more extensive social interactions, or games.
Knowing how to develop network applications is an important
development skill.
Network Addressing
An IP address is represented by the InetAddress class.
Addresses can be either unicast where it identifies a specific
address, or it can be multicast, where a message is sent to more
than one address.
The InetAddress class has no public constructors. To get an
instance, use one of the several static get type methods. For
example, the getByName method takes a string representing the
address as shown next. The string in this case is a Uniform
Resource Locator (URL)(shown below)
InetAddress address =
52
InetAddress.getByName("www.google.com");
System.out.println(address);
System.out.println("CanonicalHostName: "
+ address.getCanonicalHostName());
System.out.println("HostAddress: " +
address.getHostAddress());
System.out.println("HostName:"+address.getHostName());
The NetworkInterface class provides a means of accessing the
devices that act as nodes on a network. This class also provides
a means to get low-level device addresses. Many systems are
connected to multiple networks at the same time. These may be
wired, such as a network card, or wireless, such as for a
wireless LAN or Bluetooth connection.
The NetworkInterface class represents an IP address and
provides information about this IP address. A network
interface is the point of connection between a computer and a
network. This frequently uses an NIC of some type. It does not
have to have a physical manifestation, but it can be performed
in software as done with the loopback connection (127.0.0.1 for
IPv4 and ::1 for IPv6).
The NetworkInterface class does not have any public
constructors. Three static methods are provided to return an
instance of the NetworkInterface class:
getByInetAddress: This is used if the IP address is known
getByName: This is used if the interface name is known
53
getNetworkInterfaces: This provides an enumeration of
available interfaces
The following code illustrates how to use
the getNetworkInterfaces method to obtain and display an
enumeration of the network interfaces for the current computer:
try {
Enumeration<NetworkInterface> interfaceEnum =
NetworkInterface.getNetworkInterfaces();
System.out.printf("Name Display name\n");
for(NetworkInterface element :
Collections.list(interfaceEnum)) {
System.out.printf("%-8s %-32s\n",
element.getName(), element.getDisplayName());
} catch (SocketException ex) {
// Handle exceptions
These terms are used to refer to the name and location of an
Internet resource. A URI identifies the name of a resource, such
as a website, or a file on the Internet. It may contain the name of
a resource and its location.
A URL specifies where a resource is located, and how to
retrieve it. A protocol forms the first part of the URL, and
specifies how data is retrieved. URLs always contain protocol,
54
such as HTTP, or FTP. For example, the following two URLs
use different protocols. The first one uses the HTTPS protocol,
and the second one uses the FTP protocol:
https://www.packtpub.com/
ftp://speedtest.tele2.net/
Java provides classes to support URIs and URLs. The
discussion of these classes begins in the next section. Here, we
will discuss URNs in more depth.
A URN identifies the resource but not its location. A URN is
like a city's name, while a URL is similar to a city's latitude and
longitude. When a resource, such as web page, or file, is moved,
the URL for the resource is no longer correct. The URL will
need to be updated wherever it is used. A URN specifies the
name of a resource but not its location. Some other entity,
when supplied with a URN, will return its location. URNs are
not used that extensively.
The syntax of a URN is shown next. The <NID> element is a
namespace identifier and <NSS> is a namespace-specific string:
<URN> ::= "urn:" <NID> ":" <NSS>
The general syntax of a URI consists of a scheme and a scheme-
specific-part:
[scheme:] scheme-specific-part
There are many schemes that are used with a URI, including:
file: This is used for files systems
FTP: This is File Transfer Protocol
55
HTTP: This is commonly used for websites
mailto: This is used as part of a mail service
urn: This is used to identify a resource by name
The scheme-specific-part varies by the scheme that is used.
URIs can be categorized as absolute or relative, or as opaque or
hierarchical. These distinctions are not of immediate interest to
us here, though Java provides methods to determine whether a
URI falls into one of these categories.
A URI can be created for different schemes using several
constructor variations. The simplest way of creating a URI is to
use a string argument specifying the URI, as shown here:
URI uri = new
URI("https://www.packtpub.com/books/content/support");
The next URI uses a fragment to access a subsection of the
Wikipedia article dealing with the normalization of a URL:
uri = new URI("https://en.wikipedia.org/wiki/"
+ "URL_normalization#Normalization_process");
We can also use the following version of the constructor to
specify the scheme, host, path, and fragment of the URI:
uri = new
URI("https","en.wikipedia.org","/wiki/URL_normalization",
"Normalization_process");
There are several ways of creating a URL instance. The easiest
is to simply provide the URL of the site as the argument of the
56
class' constructor. This is illustrated here where a URL instance
for the Packtpub website is created:
URL url = new URL("http://www.packtpub.com");
A URL requires a protocol to be specified. For example, the
following attempt to create a URL will result in
a java.net.MalformedURLException: no protocol:
www.packtpub.com error message:
url = new URL("www.packtpub.com");
There are several constructor variations. The following two
variations will create the same URL. The second one uses
parameters for the protocol, host, port number, and file:
url = new URL("http://pluto.jhuapl.edu/");
url = new URL("http", "pluto.jhuapl.edu", 80,
"News-Center/index.php");
The InetAddress class represents an IP address. The IP protocol
is a low-level protocol used by the UDP and TCP protocols. An
IP address is either a 32-bit or a 128-bit unsigned number that is
assigned to a device.
IP addresses have a long history and use two major versions:
IPv4 and IPv6. The number 5 was assigned to the Internet
Stream Protocol. This was an experimental protocol, but it was
never actually referred to as version IPv5 and was not intended
for general use.
The InetAddress class' getAllByName method will return the IP
address for a given URL. In the following example,
the addresses associated with www.google.com are displayed:
57
InetAddress names[] =
InetAddress.getAllByName("www.google.com");
for(InetAddress element : names) {
System.out.println(element);
}
NIO Support
The java.io, java.nio, and java.nio subpackages provide most of
the Java support for IO processing. We will examine the
support that these packages provide for network .Here, we will
focus on the basic aspects of the java.nio package.
There are three key concepts used in the NIO package:
Channel: This represents a stream of data between
applications
Buffer: This works with a channel to process data
Selector: This is a technology that allows a single thread to
handle multiple channels
Fig 3.1 Channels and Buffers
A channel and a buffer are typically associated with each other.
Data may be transferred from a channel to a buffer or from a
58
buffer to a channel. The buffer, as its name implies, is a
temporary repository for information. The selector is useful in
supporting application scalability.There are four primary
channels:
FileChannel: This works with a file
DatagramChannel: This supports UDP communications
SocketChannel: This is used with a TCP client
ServerSocketChannel: This is used with a TCP server
A simple way of accessing a server is to use
the URLConnection class. This class represents a connection
between an application and a URL instance. A URL instance
represents a resource on the Internet.
In the next example, a URL instance is created for the Google
website. Using the URL class' openConnection method,
a URLConnection instance is created.
A BufferedReader instance is used to read lines from the
connection that is then displayed:
try {
URL url = new URL("http://www.google.com");
URLConnection urlConnection = url.openConnection();
BufferedReader br = new BufferedReader(
new InputStreamReader(urlConnection.getInputStream()));
String line;while ((line = br.readLine()) != null) {
System.out.println(line);
}
br.close();
} catch (IOException ex) {
// Handle exceptions
}
59
We can rework the previous example to illustrate the use of
channels and buffers. The URLConnection instance is created
as before. We will create a ReadableByteChannel instance and
then a ByteBuffer instance, as illustrated in the next example.
The ReadableByteChannel instance allows us to read from the
site using its read method. A ByteBuffer instance receives data
from the channel and is used as the argument of
the read method. The buffercreated holds 64 bytes at a time.
The read method returns the number of bytes read.
The ByteBuffer class' array method returns an array of bytes,
which is used as the argument of the String class' constructor.
This is used to display the data read. The clearmethod is used to
reset the buffer so that it can be used again:
try {
URL url = new URL("http://www.google.com");
URLConnection urlConnection = url.openConnection();
InputStream inputStream =
urlConnection.getInputStream();
ReadableByteChannel channel =
Channels.newChannel(inputStream);
ByteBuffer buffer = ByteBuffer.allocate(64);
String line = null;
while (channel.read(buffer) > 0) {
System.out.println(new String(buffer.array()));
buffer.clear();
60
}
channel.close();
} catch (IOException ex) {
// Handle exceptions
There are three classes used to support asynchronous channel
operations:
AsynchronousSocketChannel: This is a simple asynchronous
channel to a socket
AsynchronousServerSocketChannel: This is an asynchronous
channel to a server socket
AsynchronousDatagramChannel: This is a channel for a
datagram-oriented socket
The read/write methods of
the AsynchronousSocketChannel class are asynchronous.
The AsynchronousServerSocketChannel class possesses
an accept method, which returns
an AsynchronousSocketChannel instance. This method is also
asynchronous.
There are two ways of handling asynchronous I/O operations:
Using the Future interface found in the java.util.concurrent package
Using a CompletionHandler interface
The Future interface represents a pending result. This supports
asynchronous operations by allowing the application to
61
continue executing and not block. Using this object, you can use
one of the following methods:
The isDone method
The get method, which blocks until completion
The get method is overloaded with one version supporting a
timeout. The CompletionHandler instance is invoked when the
operation has completed. This is essentially a callback. We will
not illustrate this approach here.
We will develop an asynchronous server and client
called AsynchronousServerSocketChannelServer and Asynchro
nousSocketChannelClient, respectively. The client/server
application is limited and only allows messages to be sent from
the client to the server. This will allow us to focus on the
asynchronous aspects of the application.
Client Server Architecture
There are several ways of creating servers using Java. We will
illustrate a couple of simple approaches. A server is installed on
a machine with an IP address. It is possible for more than one
server to be running on a machine at any given time. When the
operating system receives a request for a service on a machine,
it will also receive a port number. The port number will identify
the server to where the request should be forwarded. A server
is, thus, identified by its combination of IP address and port
number.
Typically, a client will issue a request to a server. The server
will receive the request and send back a response. The nature of
the request/response and the protocol used for communication is
dependent on the client/server. Sometimes a well-documented
62
protocol, such as the Hypertext Transfer Protocol (HTTP), is
used. For simpler architectures, a series of text messages are
sent back and forth.
For the server to communicate with an application making a
request, specialized software is used to send and receive
messages. This software is called a socket. One socket is found
on the client side, and the other socket is located on the server
side. When they connect, communication is possible. There are
several different types of sockets. These include datagram
sockets; stream sockets, which frequently use TCP; and raw
sockets, which normally work at the IP level. We will focus on
TCP sockets for our client/server application.
Specifically, we will create a simple echo server. This server
will receive a text message from a client and will immediately
send it back to that client. The simplicity of this server allows
us to focus on the client-server basics. The below code
demonstrate simple echo server and client example.
Fig 3.2 HTTP Network Communication
/*
* To change this license header, choose License Headers in
Project Properties.
63
* To change this template file, choose Tools | Templates
* and open the template in the editor.
*/
package simpleechoserver;
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.io.PrintWriter;
import java.net.ServerSocket;
import java.net.Socket;
import java.util.logging.Level;
import java.util.logging.Logger;
/**
*
* @author deepakmali
*/
public class SimpleEchoServer {
/**
* @param args the command line arguments
*/
public static void main(String[] args) {
try {
ServerSocket serverSocket = new ServerSocket(6000);
64
System.out.println("Waiting for the Connection....");
Socket clientSocket = serverSocket.accept();
System.out.println("Connection Established");
BufferedReader reader= new BufferedReader(new
InputStreamReader(clientSocket.getInputStream()));
System.out.println(reader.readLine());
PrintWriter writer = new
PrintWriter(clientSocket.getOutputStream(),true);
String inputLine;
while((inputLine=reader.readLine())!=null){
System.out.println(inputLine);
writer.write(inputLine);}
} catch (IOException ex) {
Logger.getLogger(SimpleEchoServer.class.getName()).log(Lev
el.SEVERE, null, ex);
// TODO code application logic here
}
}
/*
* To change this license header, choose License Headers in
Project Properties.
65
* To change this template file, choose Tools | Templates
* and open the template in the editor.
*/
package simpleechoclient;
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.io.PrintStream;
import java.io.PrintWriter;
import java.net.Inet4Address;
import java.net.InetAddress;
import java.net.Socket;
import java.net.UnknownHostException;
import java.util.Scanner;
import java.util.function.Supplier;
import java.util.logging.Level;
import java.util.logging.Logger;
/**
* @author deepakmali
66
*/
public class SimpleEchoClient {
/**
* @param args the command line arguments
*/
public static void main(String[] args){
try {
InetAddress localAddress = InetAddress.getLocalHost();
System.out.println(localAddress.getHostAddress());
try {
Socket clientSocket = new Socket(localAddress,6000);
PrintStream writer = new
PrintStream(clientSocket.getOutputStream());
BufferedReader reader = new BufferedReader(new
InputStreamReader(System.in));
// PrintWriter writer =new PrintWriter();
//BufferedReader reader = new BufferedReader(new
InputStreamReader(clientSocket.getInputStream()));
System.out.println("Connected to Server
...."+clientSocket.getRemoteSocketAddress());
//Scanner scanner = new Scanner(System.in);
67
while(true){
// BufferedReader reader = new BufferedReader(new
InputStreamReader(System.in));
String readline =reader.readLine();
System.out.println("Enter Text ....");
/* Supplier<String> socket_input = ()->{
try {
return reader.readLine();
catch(IOException e)
{return null;
}; */
//String inputline =scanner.nextLine();
//System.out.println(inputline);
if ("quit".equalsIgnoreCase(readline))
break;
//String readline =reader.readLine();
writer.println(readline);
System.out.println("Server Response" + readline);
68
}
} catch (IOException ex) {
Logger.getLogger(SimpleEchoClient.class.getName()).log(Lev
el.SEVERE, null, ex);
} catch (UnknownHostException ex) {
Logger.getLogger(SimpleEchoClient.class.getName()).log(Lev
el.SEVERE, null, ex);
The WebServer definition is shown next.
The ServerSocket class is the foundation of the server.
Its accept method will block until a request is made. When this
happens, a new thread based on the ClientHandler class will be
started:
public class WebServer {
public WebServer() {
System.out.println("Webserver Started");
try (ServerSocket serverSocket = new ServerSocket(80)) {
while (true) {
System.out.println("Waiting for client request");
69
Socket remote = serverSocket.accept();
System.out.println("Connection made");
new Thread(new ClientHandler(remote)).start();
} catch (IOException ex) {
ex.printStackTrace();
public static void main(String args[]) {
new WebServer();
Mac users may encounter an error when using port 80. Use
port 3000 or 8080 instead. Threads are concurrently executing
sequences of code within a process. In Java, a thread is created
using the Thread class. The constructor's argument is an object
that implements the Runnable interface. This interface consists
of a single method: run. When the thread is started using
the start method, a separate program stack is created for the new
thread, and the runmethod executes on this stack. When
the run method terminates, the thread terminates.
The ClientHandler class, shown next, implements
the Runnable interface. Its constructor is passed to the socket
70
representing the client. When the thread starts, the run method
executes. The method displays, starting and terminating
messages. The actual work is performed in
the handleRequest method:
public class ClientHandler implements Runnable {
private final Socket socket;
public ClientHandler(Socket socket) {
this.socket = socket;
@Override
public void run() {
System.out.println("\nClientHandler Started for " +
this.socket);
handleRequest(this.socket);
System.out.println("ClientHandler Terminated for "
+ this.socket + "\n");
The handleRequest method uses the input and output streams to
communicate with the server. In addition, it determines what
request was made and then processes that request.
71
In the code that follows, the input and output streams are
created and the first line of the request is read.
The StringTokenizer class is used to token this line. When
the nextToken method is invoked, it returns the first word of the
line, which should correspond to an HTTP method:
public void handleRequest(Socket socket) {
try (BufferedReader in = new BufferedReader(
new InputStreamReader(socket.getInputStream()));) {
String headerLine = in.readLine();
StringTokenizer tokenizer =
new StringTokenizer(headerLine);
String httpMethod = tokenizer.nextToken();
...
} catch (Exception e) {
e.printStackTrace();
A tokenizer is a process that splits text into a series of tokens.
Frequently, these tokens are simple words.
The StringTokenizer class's constructor is passed the text to be
tokenized. The nextToken method will return the next available
token.
72
The next code sequence handles the GET method. A message is
displayed on the server side to indicate that a GET method is
being processed. This server will return a simple HTML page.
The page is built using the StringBuilderclass where
the append methods are used in a fluent style.
The sendResponse method is then invoked to actually send the
response. If some other method was requested, then a 405 status
code is returned:
if (httpMethod.equals("GET")) {
System.out.println("Get method processed");
String httpQueryString = tokenizer.nextToken();
StringBuilder responseBuffer = new StringBuilder();
responseBuffer
.append("<html><h1>WebServer Home Page....
</h1><br>")
.append("<b>Welcome to my web server!</b><BR>")
.append("</html>");
sendResponse(socket, 200, responseBuffer.toString());
} else {
System.out.println("The HTTP method is not recognized");
sendResponse(socket, 405, "Method Not Allowed");
73
If we wanted to handle other methods, then a series of else-if
clauses can be added. To further process the GET method, we
will need to parse the remainder of the initial request line. The
following statement will give us a string that we can process:
String httpQueryString = tokenizer.nextToken();
The previous statement is not needed for this example and
should not be included in the code. It simply offers one possible
way of further processing HTTP queries.
Once we have created a response, we will use
the sendResponse method to send it to the client as shown next.
This method is passed the socket, a status code, and the
response string. An output stream is then created:
public void sendResponse(Socket socket,
int statusCode, String responseString) {
String statusLine;
String serverHeader = "Server: WebServer\r\n";
String contentTypeHeader = "Content-Type:
text/html\r\n";
try (DataOutputStream out =
new DataOutputStream(socket.getOutputStream());) {
...
out.close();
} catch (IOException ex) {
74
// Handle exception
If the status code is 200, then a simple HTML page is returned.
If the status code is 405, then a single status code line is
returned. Otherwise, a 404 response is sent. As we used
the DataOutputStream class to write, we use
its writeBytes method to handle strings:
if (statusCode == 200) {
statusLine = "HTTP/1.0 200 OK" + "\r\n";
String contentLengthHeader = "Content-Length: "
+ responseString.length() + "\r\n";
out.writeBytes(statusLine);
out.writeBytes(serverHeader);
out.writeBytes(contentTypeHeader);
out.writeBytes(contentLengthHeader);
out.writeBytes("\r\n");
out.writeBytes(responseString);
} else if (statusCode == 405) {
statusLine = "HTTP/1.0 405 Method Not Allowed" + "\r\n";
out.writeBytes(statusLine);
75
out.writeBytes("\r\n");
} else {
statusLine = "HTTP/1.0 404 Not Found" + "\r\n";
out.writeBytes(statusLine);
out.writeBytes("\r\n");
We will use the following HTTPClient class to access our
HTTP server. In its constructor, a socket connecting to the
server is created.
The Socket class's getInputStream and getOutputStream return
input and output streams for the socket, respectively.
The sendGet method is called, which sends a request to the
server. The getResponse method returns the response, which is
then displayed:
public class HTTPClient {
public HTTPClient() {
System.out.println("HTTP Client Started");
try {
InetAddress serverInetAddress =
InetAddress.getByName("127.0.0.1");
Socket connection = new Socket(serverInetAddress,
80);
76
try (OutputStream out = connection.getOutputStream();
BufferedReader in =
new BufferedReader(new
InputStreamReader(
connection.getInputStream()))) {
sendGet(out);
System.out.println(getResponse(in));
} catch (IOException ex) {
ex.printStackTrace();
...
public static void main(String[] args) {
new HTTPClient();
The sendGet method follows this, which sends a GET method
request using a simple path. This is followed by a User-
Agent header. We used an instance of the OutputStream class
with the write method. The write method requires an array of
77
bytes. The String class's getBytes method returns this array of
bytes:
private void sendGet(OutputStream out) {
try {
out.write("GET /default\r\n".getBytes());
out.write("User-Agent: Mozilla/5.0\r\n".getBytes());
} catch (IOException ex) {
ex.printStackTrace();
The getResponse method is as follows and is passed
a BufferedReader instance to get the response from the server.
It returns a string created using the StringBuilder class:
private String getResponse(BufferedReader in) {
try {
String inputLine;
StringBuilder response = new StringBuilder();
while ((inputLine = in.readLine()) != null) {
response.append(inputLine).append("\n");
return response.toString();
78
} catch (IOException ex) {
ex.printStackTrace();
return "";
Specifically, we will use
the HttpURLConnection and HTTPServer classes to
implement a client and server application. These classes support
much of the functionality required for clients and servers. Using
these classes will avoid writing low-level code to implement
HTTP functionality. Low-level code refers to the non-
specialized classes, such as the Socket class. Higher-level and
more specialized classes, such as
the HttpURLConnection and HTTPServerclasses, supplement
and provide additional support for specialized operations.
UDP and Multitasking
Multicasting is a useful technique to use if you need to send
messages to a group on a periodic basis. It uses a UDP server
and one or more UDP clients. To illustrate this capability, we
will create a simple time server. The server will send a date and
time string to clients every second.
Multicasting will send an identical message to every member of
a group. A group is identified by a multicast address. A
multicast address must use the following IP address
range: 224.0.0.0 through 239.255.255.255. The server will send
a message mark with this address. Clients must join the group
before they can receive any multicast messages.
79
A MulticastServer class is declared next, where
a DatagramSocket instance is created. The try-catch blocks will
handle exceptions as they occur:
public class MulticastServer {
public static void main(String args[]) {
System.out.println("Multicast Time Server");
DatagramSocket serverSocket = null;
try {
serverSocket = new DatagramSocket();
...
}
} catch (SocketException ex) {
// Handle exception
} catch (IOException ex) {
// Handle exception
}
The body of the try block uses an infinite loop to create an array
of bytes to hold the current date and time. Next,
an InetAddress instance representing the multicast group is
created. Using the array and the group address,
a DatagramPacket is instantiated and used as an argument
80
to the DatagramSocket class' send method. The data and time
sent is then displayed. The server then pauses for one second:
while (true) {
String dateText = new Date().toString();
byte[] buffer = new byte[256];
buffer = dateText.getBytes();
InetAddress group = InetAddress.getByName("224.0.0.0");
DatagramPacket packet;
packet = new DatagramPacket(buffer, buffer.length,
group, 8888);
serverSocket.send(packet);
System.out.println("Time sent: " + dateText);
try {
Thread.sleep(1000);
} catch (InterruptedException ex) {
// Handle exception
This server only broadcasts messages. It never receives
messages from a client.The client is created using the
following MulticastClient class. In order to receive a message,
81
the client must use the same group address and port number.
Before it can receive messages, it must join the group using
the joinGroup method. In this implementation, it receives 5 date
and time messages, displays them, and then terminates.
The trim method removes leading and trailing white space, from
a string. Otherwise, all 256 bytes of the message will be
displayed:
public class MulticastClient {
public static void main(String args[]) {
System.out.println("Multicast Time Client");
try (MulticastSocket socket = new MulticastSocket(8888)) {
InetAddress group =
InetAddress.getByName("224.0.0.0");
socket.joinGroup(group);
System.out.println("Multicast Group Joined");
byte[] buffer = new byte[256];
DatagramPacket packet =
new DatagramPacket(buffer, buffer.length);
for (int i = 0; i < 5; i++) {
socket.receive(packet);
String received = new String(packet.getData());
System.out.println(received.trim());
82
}
socket.leaveGroup(group);
} catch (IOException ex) {
// Handle exception
System.out.println("Multicast Time Client Terminated");
Scalability
When the demand on a server increases and decreases, it is
desirable to change the resources dedicated to the server. The
options available range from the use of manual threads to allow
concurrent behavior to those embedded in specialized classes to
handle thread pools and NIO channels.In this section, we will
use threads to augment our simple echo server. The definition
of the ThreadedEchoServer class is as follows. It implements
the Runnable interface to create a new thread for each
connection. The private Socket variable will hold the client
socket for a specific thread:
public class ThreadedEchoServer implements Runnable {
private static Socket clientSocket;
public ThreadedEchoServer(Socket clientSocket) {
this.clientSocket = clientSocket;
83
}
...
The main method creates the server socket as before, but when
a client socket is created, the client socket is used to create a
thread, as shown here:
public static void main(String[] args) {
System.out.println("Threaded Echo Server");
try (ServerSocket serverSocket = new ServerSocket(6000)) {
while (true) {
System.out.println("Waiting for connection.....");
clientSocket = serverSocket.accept();
ThreadedEchoServer tes =
new ThreadedEchoServer(clientSocket);
new Thread(tes).start();
} catch (IOException ex) {
// Handle exceptions
System.out.println("Threaded Echo Server Terminating");
}
84
The actual work is performed in the run method as shown next.
It is essentially the same implementation as the original echo
server, except that the current thread is displayed to clarify
which threads are being used:
@Override
public void run() {
System.out.println("Connected to client using ["
+ Thread.currentThread() + "]");
try (BufferedReader br = new BufferedReader(
new InputStreamReader(
clientSocket.getInputStream()));
PrintWriter out = new PrintWriter(
clientSocket.getOutputStream(), true)) {
String inputLine;
while ((inputLine = br.readLine()) != null) {
System.out.println("Client request ["
+ Thread.currentThread() + "]: " + inputLine);
out.println(inputLine);
}
System.out.println("Client [" + Thread.currentThread()
+ " connection terminated");
85
} catch (IOException ex) {
// Handle exceptions
}
}
The chief advantage of a multithreaded server is that long-
running client requests will not block the server from accepting
other client requests. If a new thread is not created, then the
current request will be processed. It is only after the request has
been processed that new requests can be accepted. Using a
separate thread for a request means that connections and their
associated requests can be processed concurrently.
When using a multithreaded server, there are several of ways of
configuring the threads as follows:
Thread-per-request
Thread-per-connection
Thread-per-object
Network Security
Security is a complex topic. In this section, we will demonstrate
a few simple aspects of this topic, as it relates to networks.
Specifically, we will create a secure echo server. Creating a
secure echo server is not that much different from the non-
secure echo server that we developed earlier. However, there is
a lot going on behind the scenes to make it work.
There are many security related terms whose meaning and
purpose can be daunting when they are first encountered. Most
of these terms are applicable to network applications. We will
86
start with a brief overview of many of these terms. In later
sections of this chapter, we will go into more details about the
ones that are relevant to our discussion.
Central to most security related issues is encryption. This is the
process of converting information that needs to be protected to
an encrypted form using a key or a set of keys. The receiver of
the encrypted information can use a key or set of keys to
decrypt the information and revert it to its original form. This
technique will prevent unauthorized access to the information.
We will demonstrate the use of both symmetric and
asymmetric encryption techniques. Symmetric encryption uses
a single key to encrypt and decrypt messages. Asymmetric
encryption uses a pair of keys. These keys are frequently stored
in a file called a keystore, which we will demonstrate.
Symmetric encryption is usually faster but requires the sender
and receiver of the encrypted data to share their keys in a safe
and secure manner. For parties that are remotely dispersed, this
can be a problem. Asymmetric encryption is slower, but it uses
a public and private key pair that, as we will see, simplifies the
sharing of keys. Asymmetric encryption is an enabling
technology for digital certificates that provides a means of
verifying the authenticity of documents.
Secure commerce is common and is essential for online
transactions that take place globally every day. The Transport
Layer Security (TLS) and Secure Sockets Layer (SSL) are
protocols that allow secure and reliable communication across
the Internet. It is the basis for Hyper Text Transfer Protocol
Secure (HTTPS) that is used to conduct most transactions on
the Internet. This protocol supports the following:
87
Server and client authentication
Data encryption
Data integrity
Secure hashing is a technique that is used to create certificates.
A certificate is used to verify the authenticity of data, and it
uses a hash value and Java provides support for this process.
We will be using the SSLServerSocketFactory class to
instantiate secure server sockets. In addition, it is necessary to
create keys that the underlying SSL mechanism can use to
encrypt the communications.
An SSLServerSocket class is declared in the following example
to serve as the echo server. As it is similar to the previous echo
server, we will not explain its implementation, except for its
relation to the use of the SSLServerSocketFactory class. Its
static getDefault method returns an instance
of ServerSocketFactory. Its createServerSocket method returns
an instance of a ServerSocket bound to port 8000 that is capable
of supporting secure communications. Otherwise, it is
organized and functions similarly to the previous echo server:
public class SSLServerSocket {
public static void main(String[] args) {
try {
SSLServerSocketFactory ssf = (SSLServerSocketFactory)
SSLServerSocketFactory.getDefault();
ServerSocket serverSocket =
88
ssf.createServerSocket(8000);
System.out.println("SSLServerSocket Started");
try (Socket socket = serverSocket.accept();
PrintWriter out = new PrintWriter(
socket.getOutputStream(), true);
BufferedReader br = new BufferedReader(
new InputStreamReader(
socket.getInputStream()))) {
System.out.println("Client socket created");
String line = null;
while (((line = br.readLine()) != null)) {
System.out.println(line);
out.println(line);
br.close();
System.out.println("SSLServerSocket Terminated");
} catch (IOException ex) {
// Handle exceptions
} catch (IOException ex) {
89
// Handle exceptions
The secure echo client is also similar to the previous non-secure
echo client. The SSLSocketFactory class' getDefault returns
an SSLSocketFactory instance whose createSocket creates a
socket that is connected to the secure echo server. The
application is as follows:
public class SSLClientSocket {
public static void main(String[] args) throws Exception {
System.out.println("SSLClientSocket Started");
SSLSocketFactory sf =
(SSLSocketFactory) SSLSocketFactory.getDefault();
try (Socket socket = sf.createSocket("localhost", 8000);
PrintWriter out = new PrintWriter(
socket.getOutputStream(), true);
BufferedReader br = new BufferedReader(
new InputStreamReader(
socket.getInputStream()))) {
Scanner scanner = new Scanner(System.in);
90
while (true) {
System.out.print("Enter text: ");
String inputLine = scanner.nextLine();
if ("quit".equalsIgnoreCase(inputLine)) {
break;
out.println(inputLine);
System.out.println("Server response: " +
br.readLine());
System.out.println("SSLServerSocket Terminated");
}
}
}
If we executed this server followed by the client, they will abort
with a connection error. This is because we have not provided a
set of keys that the applications can share and use to protect the
data passed between them.
To provide the necessary keys, we need to create a keystore to
hold the keys. When the applications execute, the keystore must
be available to the applications. First, we will demonstrate how
to create a keystore, and then we will show you which runtime
parameters must be supplied.
91
Within the Java SE SDK's bin directory is a program
titled keytool. This is a command-level program that will
generate the necessary keys and store them in a key file. In
Windows, you will need to bring up a command window and
navigate to the root directory of your source files. This directory
will contain the directory holding your application's package.
ou will also need to set the path to the bin directory using a
command that is similar to the following one. This command is
needed to find and execute the keytool application:
set path= C:\Program Files\Java\jdk1.8.0_25\bin;%path%
Next, enter the keytool command. You will be prompted for a
password and other information that is used to create the keys.
This process is shown here, where a password of 123456 is
used although it is not displayed as it is entered:
Enter keystore password:
Re-enter new password:
What is your first and last name?
[Unknown]: First Last
What is the name of your organizational unit?
[Unknown]: packt
What is the name of your organization?
[Unknown]: publishing
What is the name of your City or Locality?
[Unknown]: home
92
What is the name of your State or Province?
[Unknown]: calm
What is the two-letter country code for this unit?
[Unknown]: me
Is CN=First Last, OU=packt, O=publishing, L=home,
ST=calm, C=me correct?
[no]: y
Enter key password for <mykey>
(RETURN if same as keystore password):
With the keystore created, you can run the server and client
applications. How these applications are started depends on
how your projects have been created. You may be able to
execute it from an IDE, or you may need to start them from a
command window.
Next are the commands that can be used from a command
window. The two arguments to the java command are the
location of the keystore and a password. They need to be
executed from the root directory of your package's directory:
java -Djavax.net.ssl.keyStore=keystore.jks -
Djavax.net.ssl.keyStorePassword=123456
packt.SSLServerSocket
java -Djavax.net.ssl.trustStore=keystore.jks -
Djavax.net.ssl.trustStorePassword=123456
packt.SSLClientSocket
93
If you want to use an IDE, then use the equivalent settings for
your runtime command arguments. The following one
illustrates one possible interchange between the client and the
server. The output of the server window is shown first, followed
by that of the client:
SSLServerSocket Started
Client socket created
Hello echo server
Safe and secure
SSLServerSocket Terminated
SSLClientSocket Started
Enter text: Hello echo server
Server response: Hello echo server
Enter text: Safe and secure
Server response: Safe and secure
Enter text: quit
SSLServerSocket Terminated
94
CHAPTER 4- MICROSERVICES USING SPRING
CLOUD
Getting Started
Microservices are an architectural style where a single
application is built as a suite of small services each running as
independent processes and intercommunicating via open
protocols .It deals with building independent applications as a
suite of smaller services ,each of which run as an independent
process. Microservices communicate with each other using
open protocols like HTTP .These services are separately
deployed and maintained and encapsulate business functionality
.Services are independently replaceable .They are not same as
SOA its about decomposing single application .Organizations
are moving fast onto microservices like Netflix has done a great
deal of work to eliminate the Java Monolithic architecture .
Monolithic versus Microservices
Lets take an example of a traditional monolithic shopping cart
application to understand microservices better.
Fig 4.1 Monolithic Architecture
95
The issue here is earlier we used to support only web interfaces
as the client interfaces but now we have proliferation of client
technologies .The figure below depicts the scene .The
controllers which were designed for the web interfaces were not
designed for these other channels. There are also a lot of
superior and numerous backend technologies .Monolithic code
base is also tough to be managed by larger and bigger teams
working in parallel. Microservices allow the team to work on
separate business functionalities (cross functional teams ) .We
are also stuck in monolithic applications in case we want to use
better technologies for some pieces of the functionalities .
Fig. 4.2 Varied UI Interfaces in modern day applications
The advantages and disadvantages of monolithic are stated
under :
Advantages :
96
Drawbacks:
The same Shopping cart architecture using microservices could
be portrayed as below (using polyglot persistence) :
Fig. 4.3 Scenario Depicting MicroServices
Microservices should have very clear interfaces as part of the
API gateway .Netflix cloud native architecture is a very good
example of Microservices .Microservices should not be
orchestrated but choreographed. (decentralized governance
).We can break a monolithic into a microservices using a single
responsibility principal of bounded context besides splitting into
Noun / verb based functionality Microservices are smart but
97
their communication should be dump. Thus the advantages of
microserivces are as follows:
Advantages:
Disadvantages :
Spring Boot
It is a technology to started getting started with Spring quickly
.(http://docs.spring.io/spring-boot/docs/current-
SNAPSHOT/reference/htmlsingle/) Spring Boot provides
intelligent defaults for configuration .It is not about code
98
generation .It gives Easier dependency management .Its about
automatic configuration and provides different build options
.We can go to start.spring.io to download the Spring Initializer
and can be used in any IDE.The following demonstrate the set
up screen .
The spring boot starter will resolve all the transitive
dependencies
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-parent</artifactId>
Other important plugins it will bring along are as follows :
<plugin>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-maven-plugin</artifactId>
</plugin>
- <plugin>
<groupId>org.springframework.boot</groupId>
99
<artifactId>spring-boot-maven-plugin</artifactId>
- <dependencies>
- <dependency>
<groupId>org.springframework</groupId>
<artifactId>springloaded</artifactId>
<version>1.2.0.RELEASE</version>
</dependency>
</dependencies>
</plugin>
MicroServicesBootApplication Java class is the default java
class .We can run the class as described in the screenshot below
.The following operations are accomplished as part of the
application execution
100
We can add the dependency Spring-boot-starter-web in pom to
turn it to a web application and run it as a Java application .The
test controller we may write looks like below :
package demo.controller;
import org.springframework.stereotype.Controller;
import
org.springframework.web.bind.annotation.RequestMapping;
@Controller
public class HelloController {
@RequestMapping("/hi")
public @ResponseBody String hi() {
return "hello";
101
It will respond back to localhost:8080/hi with hello .The
following actions are taken here : (Spring Boot will launch
Tomcat on its own )
We can turn this into a war file using the
SpringBootServletInitializer in the application class and the
setting in pom to war .
Spring Boot also supports JSP and templates like Freemarker ,
Thymeleaf .For Tymeleaf , we will add the following
dependency .Please keep the template under the templates
folder .
102
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-thymeleaf</artifactId>
The controller can be changed as follows :
@Controller
public class HelloController {
@RequestMapping("/hi/{name}")
public String hi(Map model, @PathVariable String
name) {
model.put("name",name);
return "hello";
The template can be changed as follows :
For using Rest capabilities we make use of domain values as
return values /parameters .We can mark with annotations
@RequestBody and @ResonseBody .Spring automatically
handles xml and JSON conversions .We create domain objects
Team and Player as follows :
package demo.domain;
103
import javax.persistence.Entity;
import javax.persistence.GeneratedValue;
import javax.persistence.Id;
@Entity
public class Player {
@Id
@GeneratedValue
Long id;
String name;
String position;
public Player() {
super();
public Player(String name, String position) {
this();
this.position = position;
this.name = name;
public Long getId() {
return id;
104
}
public void setId(Long id) {
this.id = id;
public String getName() {
return name;
public void setName(String name) {
this.name = name;
public String getPosition() {
return position;
public void setPosition(String position) {
this.position = position;
package demo.domain;
import java.util.Set;
import javax.persistence.CascadeType;
105
import javax.persistence.Entity;
import javax.persistence.GeneratedValue;
import javax.persistence.Id;
import javax.persistence.JoinColumn;
import javax.persistence.OneToMany;
@Entity
public class Team {
@Id
@GeneratedValue
Long id;
String name;
String location;
String mascotte;
@OneToMany(cascade=CascadeType.ALL)
@JoinColumn(name="teamId")
Set<Player> players;
public Team() {
super();
106
public Team(String location, String name, Set<Player>
players) {
this();
this.name = name;
this.location = location;
this.players = players;
public Long getId() {
return id;
public void setId(Long id) {
this.id = id;
public String getName() {
return name;
public void setName(String name) {
this.name = name;
public String getLocation() {
107
return location;
public void setLocation(String location) {
this.location = location;
public String getMascotte() {
return mascotte;
public void setMascotte(String mascotte) {
this.mascotte = mascotte;
public Set<Player> getPlayers() {
return players;
public void setPlayers(Set<Player> players) {
this.players = players;
We will change the controller as follows :
package demo.controller;
108
import org.springframework.stereotype.Controller;
import
org.springframework.web.bind.annotation.RequestMapping;
@RRestController
public class HelloController {
private Team team;
public void init(){
Set<Player> players = new HashSet();
players.add (new Player ("Charlie Brown","pitcher" ));
players .add (new Player ("Snoopy","shortstop"));
team = new Team ("California","Peanuts",players);
@RequestMapping("/hi")
public Team hi() {
return team;
The browser will return us a json response as follows :
109
In case we need the response in XML , annotate the domain
class with JAXB .Next we will add some JPA capability by
adding the spring-boot-starter-data-jpa (<artifactId>spring-boot-
starter-data-jpa</artifactId>) which adds Spring Transaction
management, adds Spring ORM, adds Hibernate /Entity
manager and Spring Data JPA .It will not bring Data base driver
.For the DAO layer, we can provide the interface and Spring
Data will implement it for us .
We should add the JPA annotations to the domain objects
.(already included in the above code for domain classes )
We need to add the following dependency in the pom for hsql
DB
<groupId>org.hsqldb</groupId>
<artifactId>hsql-db</artifactId>
110
We need to create the repository objects for Team and Player as
follows : Crud Repository is from Spring data repository
package .
package demo.repository;
import org.springframework.data.repository.CrudRepository;
import
org.springframework.data.rest.core.annotation.RestResource;
import demo.domain.Team;
@RestResource(path="teams", rel="team")
public interface TeamRepository extends
CrudRepository<Team,Long>{
Spring data JPA is intelligent enough to implement the
following method defined by us in the CRUD Repository
Objects .
In the application class ,we can declare the Crud repository for
Team and Player and we can Autowire them as follows .
111
The controller code can be written to fetch the Team based on
the id or name .
package demo.controller;
import
org.springframework.beans.factory.annotation.Autowired;
import org.springframework.web.bind.annotation.PathVariable;
import
org.springframework.web.bind.annotation.RequestMapping;
import
org.springframework.web.bind.annotation.RestController;
import demo.domain.Team;
import demo.repository.TeamRepository;
@RestController
public class TeamController {
@Autowired TeamRepository teamRepository;
@RequestMapping("/teams")
public Iterable<Team> getTeams() {
112
return teamRepository.findAll();
@RequestMapping("/teams/{id}")
public Team getTeam(@PathVariable Long id){
return teamRepository.findOne(id);
The response will be return the json response as follows :
Last concept is of Spring Data REST which plugs into dynamic
repositories ,generates RESTful interface and thus we need the
code to override the defaults only .The flow is depicted below .
113
Fig 4.4 Architecture following MicroServices
Here we will add the dependency as :
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-data-rest</artifactId>
We add a RestResource to the DAO as follows:
The json response is going to return us the entire team as
follows
114
Thus when the applications starts the RestResource annotations
are interpreted and Spring Data creates the Controllers and
Request Mappings .In case we provide the crud repository for
Players , the output will return us the links for payers as follows
:
package demo.repository;
import org.springframework.data.repository.CrudRepository;
import
org.springframework.data.rest.core.annotation.RestResource;
import demo.domain.Player;
@RestResource(path="players", rel="player")
public interface PlayerRepository extends
CrudRepository<Player,Long>{
115
Spring Cloud
There are a lot of sub-projects under the Spring Umbrella which
emerged since 2006 as shown below. Spring cloud is a sub-
project under the Spring IO umbrella .
Fig 4.4 Spring Cloud Umbrella
Spring Cloud provides libraries to apply common patterns
needed in distributed applications such as centralized
configuration management ,service Registration ,Load
Balancing,Service to-service calls Circuit breakers ,Routing,
116
Etc,Netflix became the trailblazers in cloud computing and
choose to publish many general use technologies as open source
projects .Spring team realized that Netflix has addressed quite a
few key architectural items in the cloud environment .Spring
team realized the work already done by the Netflix team easily
consumable .Spring cloud project are based on Spring Boot .We
need to use the spring-cloud-starter-parent and spring-cloud
starter- dependencies .
Spring Cloud Config
This is the centralized version management for
distributed/cloud based applications .Applications have
connections to resources , message servers ,other applications
,etc.Usually they use external configuration to adjust the
software behavior and its not hard coded .The different
configuration options are depicted as follows:
With microservices , we have large number of dependent
services ,hence greater need for configuration .Manual
configuration will be brittle. Environment variables change
117
needs deployment. Also, we need to have traeability with
respect to version control. The desired solution for
configuration management can be mentioned as below
The advantages of the config are depicted below:
Spring Cloud config is a designated centralized server to serve
up configuration information. It is backed by some kind of
source control .Client connect over HTTP and retrieve their
configuration settlings .Spring Cloud server config uses an
environment repository .The configuration file naming
convention is
<spring-application-name>-<profile>.yml. Spring application
name is set up by the bootstrap.yml.Profile can be set to work in
different mode .
118
Fig 4.5 Spring Config
Spring cloud config server code is available at Github or we can
build our own following the below steps .
1. Include following dependencies in pom
<groupId>org.springframework.cloud</groupId>
<artifactId>spring-cloud-starter-parent</artifactId>
<version>Brixton.SR5</version>
<groupId>org.springframework.cloud</groupId>
<artifactId>spring-cloud-config-server</artifactId>
2. Modify the application.yml file with the location of
configuration repository.
spring:
cloud:
config:
server:
git:
uri: https://github.com/malidee/Microservices-With-
Spring-Student-Files
searchPaths: ConfigData
# "native" is used when the native profile is active, for
local tests with a classpath repo:
119
native:
searchLocations: classpath:offline-repository/
server:
port: 8001
3. In the application class just add the @EnableConfigServer
annotation .
On the client Side, following steps should be followed :
1. User Spring cloud starter Parent as a Parent pom.
2. Include the Spring cloud starter for config.
<groupId>org.springframework.cloud</groupId>
<artifactId>spring-cloud-starter-config</artifactId>
3. Configure Application name and server location in
bootstrap.yml
---
spring:
profiles:
active: northamerica
application:
name: lab-3-client
cloud:
config:
uri: http://localhost:8001
server:
port: 8002
120
Service Discovery with Spring Cloud Eureka
Service discovery means that the client has discovered the other
clients though registering .When we use microservices ,there is
a huge number of inter service calls.Service discovery provides
a single lookup service .Possible solutions in the space are
Eureka ,Zookeeper ,etc .Eureka comes from Netflix .Eureka
provides a lookup server .Client services register with
Eureka.by providing meta data on host ,port,health
indicatorURL ,etc. and sends heatbeats to Eureka .By enabling
the @EnableEurekaServer on the spring boot application
configuration class , we can make a eureka server (as follows)
.Generally multiple eureka servers run simultaneously and they
share a state with each other .
The common server configuration are shown below :
For the eureka client ,add the following depednecy
121
<groupId>org.springframework.cloud</groupId>
<artifactId>spring-cloud-starter-eureka-server</artifactId>
The application configuration class will need an annotation
@EnableDiscoveryClient
Finally, set the client service url in the application .properties
file.
eureka:
client:
serviceUrl:
defaultZone:
http://localhost:8010/eureka/,http://localhost:8011/eureka/,http:/
/localhost:8012/eureka/,http://localhost:8013/eureka/
Eureka server is designed for multi instance use. Many run in
different availability zones. It does not persist service
registrations and relies on client registrations .
Spring Cloud Ribbon
Load balancer is a component that sits in front of an application
(web server) and its purpose is to distribute the load or the
incoming traffic among several servers.A client side load
balancer is part of the client application software and its
purpose is to select which server to call .
122
Fig 4.6 Spring Cloud Ribbon
This kind of resilience and adaptability is often needed in a
microservice .The client side load balancer might decide based
on factors such as fast response from a server in a particular
zone .But in case the server in that region is down the client side
load balancer can route the call to other regions .Ribbon is
another project which is part of the Netflix OSS family .It
automatically integrates with Eureka.It has buid in failure
resiliency (Hystrix).Spring cloud provides an easy API wrapper
for using Ribbon.It has provision for caching and batching and
it support multiple protocols .Ribbon discovers what the list of
possible servers are .Server lists can be static or dynamic .An
example of static list is shown below :
Spring cloud refers to servers in the same zone as the Filtered
list of servers .The Ribbon client pings the server in order to
check the health of the server .This is performed by Eureka in
Spring Cloud .In Spring Cloud default load balancer is called as
ZoneAwareLoadBalancer.Rule is the single module of
intelligence that makes the decisions on whether to call or not
.The steps for using a Spring Cloud are as follows :
123
1. Use the Spring Cloud Starrter parent as a Parent pom.
2. Include the ribbon dependency
<groupId>org.springframework.cloud</groupId>
<artifactId>spring-cloud-starter-ribbon</artifactId>
3. We can then use the Ribbon code in the following manner
Declarative REST Clients using Feign
Feign is a declarative REST Cleint .It is part of the Netflix
OSS.It allows to write calls to REST services with no
implementation code. Spring cloud provides easy wrapper for
using Feign .Feing is an alternative to technology like Spring
REST template .We need to define interfaces for the Java Rest
client code.Within them, define the method signatures that we
want to make .Then annotate interfaces with Feign annotation
and annotate the individual methods with the Spring MVC
.During Runtime Feign will automatically implements the code
to call REST services and processes the response. Below shows
an implementation example.
124
In the Spring configuration classes ,we need to put an
annotation @EnableFeignClients so that during the
application start Feing libraries can provide a runtime
implementation of whatever we have configured.The below
diagram illustrates this .Feign integrates very well with Ribbon
and Eureka.
Fig 4.7 @EnableFeignClients usage at Startup
Spring Cloud Hystrix Circuit Breaker
Whenever we have large number of interacting services ,there is
always a possibility that one of them could be in a failed state
.This will effect other clients and the result of this will be a
ripple effect leading to a visible problem called as cascade
failures .(as shown below)
125
Fig 4.8 Spring Cloud Hystrix
Microservices are very vulnerable to these type of cascade
failures .Thus we have to take out and isolate the failures of
individual services to avoid cascade failures .This leads us to
the cirduite breaker pattern to avoid cascade failures .Hystrix is
a software circuit breaker which is part of the Netflix OSS.Its a
light weight easy to use wrapper provided by Spring Cloud .It
detects the failures and opens to disallow further calls .It
identifies fallback what to do in case of a service
dependency failure .It automatically closes itself after the
interval .It is said to be closed when operating normally and
open when failure is detected .In order to use Hystrix , we need
to add a dependency and enable Hystrix within a configuration
class .(as follows)
Below is an example of the Hystrix use :
126
We can fine tune the failure detection and recovery mode as
follows:
Hystrix provides a dashboard to check the status of the circuit
breakers (as shown below ).In order to enable the dashboard we
need to add the annotation @EnableHystrixDashboard in the
configuration class and add the dependencies spring-cloud-
started-hystrix-dashboard and spring-boot-starter-actuator
127
CHAPTER 5 - BIG DATA TECHNOLOGY (HADOOP)
Getting Started
Lets say we have a huge back with an insurance product who
have data in various forms like customer data, website activities
to understand the interest of the customers and their transaction
patterns and Competitors pricing and social media (Social
media like Facebook , Twitter)updates generated by market
research firms . It acts as a decision support system and bank
will generate a optimal price considering all these factors. More
the data better will be predictions .The future of bid data
analytics is that of a digital nervous system where in any input
changes will produce a corresponding result.
Reacting swiftly, adapting to changes, learning from experience
,what data is important , what is not are some of the important
parameters.
Currently data from multiple sources is fed into the dataware
house (Database ) and on top of that statistical algorithms are
run to generate analytics helpful in taking business decisions
.Since the data is sampled we dont get the clear picture of the
entire data. Also the there is a delay in decision due to cleaning
of the data .The size of the data is large and complex to get the
better picture while taking the decisions .The performance of
the traditional systems will decrease and hence there is a need
of Big Data Technologies.
The three important attributes being velocity, variety and
volume
Larger will be the data, better will be the result of the data
analytics algorithm. A simple algorithm on a large data set will
yield better result than a sophisticated algorithm on smaller data
128
set. More will be the attributes better will be the result .The data
should be write once and read many times .As pert the
international data conference 6th annual study the data will
increase by 300 times from 2005 to 2020 from 130 exabytes to
4000 exabytes and 33% of the data will be usesul as compared
to 25% used today Distributed computing is the core concept to
Hadoop.
<Difference between map reduce and rdbms>
As the data grew , growing the computing resources was
expensive due to hardware cost , software license and risk of
failure .And the data was ever increasing .In this case Hadoop
concept of distributed programming comes to save us .The task
was now distributed to many processing units or nodes
Advantages of Hadoop
Cost of commodity hardware is pretty low .The licenced
software is free of cost and in case one node goes down , other
node can process the data .Thus the performance will degrade
but it will not halt .With haddop cluster the processing power
increases 10 times with 1/10 the processing cost and 1/10 the
time .Hadoop employs parallelism thus the seek time and the
transfer time paradigm is resolved .Hadoop maintains replica
thus failure of one node does not effect the data integrity as
such .A single server architecture is expensive and inefficient
when compared with the distributed architecture .There are
number of processors communicating via MPI (message
passing interface ) and shared memory .
As the data size increases bandwidth becomes a problem
however Hadoops architecture is a little different .Haddop has
nodes which are like personal computers .There is a hard disk
with CPU on each node .The master node has data locality and
129
hence the bandwidth is used only for small update messages.
There is concept of job tracker and task tracker .Scalability is
high in Hadoop.
Fig 5.1 HDFS Architecture
Hadoop Releases
Every release is in the form of x.y.z , where ,
X = major release
Y = minor release
Z = point release
A change in the major release breaks backward compatibility
but a change in point and minor release does not.
Apache Hadoop is best known for map reduce and its
distributed file system HDFS but it has a family of projects
collectively referred to as Hadoop ecosystem.The other projects
are Pig, Hive, Scoop, Fume, Zoo Keeper, HBASE.
130
Fig 5.2 Features in different releases
HDFS
HDFS is a distributed file system .A cluster is multiple racks
put together and a single rack is multiple computers put together
which are individually named as nodes .The nodes which store
data are data nodes These are slave nodes .Name nodes are
responsible for the management of the files distributed across
the cluster. A file is broken down into smaller chunks called as
blocks .These blocks are then replicated by a factor of three
.(default) These blocks are then distributed across the cluster
and name node keeps track of complete file system and block
locations .The distribution done is smartly done to provide
resilience such that if a block fails or the complete rack fails ,
name node will be able to put the file together. HDFS handles
very large files. Data access for write once and read multiple
times is the best option .HDFS works with commodity
hardware .HDFS is not designed for quick read of data as an
OLTP database. HDFS doesnt work well with lot of small
files.
Block size is the minimum size of data that can be read or
written to a file system. Its default block size id 64 MB. The
131
idea behind large block size is to keep the seek time 1% of the
transfer rate.
Fig 5.3 Name, Nodes and Data Nodes
HDFS Architecture
HDFS follows a master slave architecture where data node
stores the data and name node manages the data space on the
worker nodes .Name node keeps the track of the complete file
system by managing 2 things -the name space image and the
edit logs .Namespace is the metadata about files and all the
directories .It contains all the metadata of the blocks .Edit log is
the log of activities performed by the client .Edit logs keep on
piling up and grow as the activity on HDFS keeps on
happening. These two combined gives the complete file system
image .As soon as the data node connects to the network and
boots up it will send the information to the name node and thus
the namespace image is updated .The file system image is
maintained in the main memory of the name node .In case the
name node fails the complete file system will go down since the
namespace image will be lost .Thats why name node is termed
as Single Point of Failure and hence it is advisable to spend
more on the name node hardware .To avoid complete failures ,
132
name node is backed up by Remote NFS mount .There is also a
secondary name node where the name node can transfer the
edit logs in case the space is a constraint through check points
.Name node and secondary name nodes are Java programs .In
case of name node fails, the administrator has to reboot the
name node .In case of failure the machine running the
secondary name node is often the best candidate for name node
.The secondary name node in this case takes the information
from NFS mount before starting to act as Name node .As a rule
of thumb, 1000 MB per million storage blocks is required for
the name node main memory .
Fig 5.4 HDFS Architecture and File System Image
The name node maintains the namespace image which is the
metadata like the block size, files, directories, permissions. Edit
log maintain all the activities done on the data anodes .Its keeps
on growing at a faster pace .These two combined give us the
complete file and block information .As soon as the data node
connects to the network the namespace image is modified with
the blocks the data node contains.This information resides in the
main memory of the name node .In case name node goes down
,its known as SPOF and name node should be resilient to
133
hardware failures and we should always spent more on the
name node .Name node is backed up with a remote NFS mount
and there is also a secondary name node feature .It donesnt
function like a name node ..Its only purpose is to merge the edit
logs and write it to a file system. And create check points on the
combined name space and edit log .Name node and secondary
name nodes are java programs .In case of failure of the name
node Hadoop admin has to boot up a new name node .in case of
failure the secondary name node machine is often the best
choice to be booted as the name node .In the later releases,
name node has been made more resilient .Name node should
have enough main memory to manage the pool of data blocks in
the cluster .1000 mb per million storage blocks is the thumb
rule.
HDFS client is a JVM has to run on the node that runs hdfc
.the client will first communicates with name node and name
node makes certain checks like if the file exists or not or the
clinet has correct permission levels or not .(dfs.replicationfacot
is set to 3 )
The name node returns the data nodes to be copied on .The clist
connects to the first dat anode and ask for the subsequent
pipeleine to copy on other data nodes. Data node acknowledges
the copy .and this goes on till the files files is copied on the
HDFS
Name node would send a completion message .in case a data
node fails the data is replicated on the other data nodes .Name
node notices under replication and arranges for replication.
In order to wirte the dat on the data node the first node is either
the node on which the client resides or the a given rack.. such
that the node is not overly loaded .the second node is chosed off
134
the rack and the third node is chosen on the same rask on which
the second node was cohhsen which forms the pipeline
.selection and replication happens behind the scenes . node
distance is related to the bandwidth .
HDFS Read /Write
HDFS client is a JVM which has to run on the node which
interacts with HDFS .The hdfs.replication property contains the
replication factor .In pseudo distribution mode it is overwritten
and set to 1 in the configuration file hdfs-site.xml .The client
communicates with the name node that it wants to write on the
name node .The name node performs various checks like
whether the file exists or not or the client has correct
permissions or not. Name node returns the list of nodes to be
copied on .Client connects to the first data node and asks it for
subsequent pipelines to data nodes .The data nodes will
acknowledge as they successfully write the block .Finally the
client sends a completion message .Name node would notice
under replication and arranges for the replication of the nodes
.If the client node is the part of the cluster name node would
consider it to be the first node where the replication should
happen else any node which is not busy or loaded is chosen .
The second one is chosen off the rack with the first one and the
third is chosen on the same rack as the second one. This forms
the pipeline .The whole process of selection and replication
happens behind the curtains.
135
Fig 5.5 HDFS Write
The distances are calculated in HDFS as follows. The idea of
distance is related to bandwidth.
Block on the same rack and node distance = 0
Block on the same rack but different node distance = 2
Block on the different rack but the same data centre distance
=4
Block on the different Data centre distance = 6
The client sends the read request to the name node .The name
node sends the data blocks containing the first few blocks
.Name node will returns the blocks starting with the closest
node to the farthest .The client will start to read the blocks one
by one.In case a data node fails client makes a note of it and
that node is not picked for later reads.
136
Fig 5.6 HDFS Read
HDFS Federation and High availability.
Federation is added to balance the load on the name node as the
cluster size increases.
A new name node can be added and the file tree structures and
the doc pool can be divided amongst the name nodes .Thus each
name node has to manage only the pool of blocks it is
associated with and not the complete pool .Data nodes can be
associated with multiple name nodes .Name nodes wont
communicate with each other and failure of one wont effect the
other.
137
Fig 5.7 HDFS Federation
High Availability
It is the time taken to come back to the stable state in case of the
name node failure .To address this the name node is always
running on stand-by .The primary name node and the secondary
name node shares the name space and the edit logs via highly
available NFS mount. In future Zoo Keeper would be used to
transition from the primary to the secondary one .Data nodes
must send reports to both the name nodes .The reserve node
fences the primary node when it takes over wherein the standby
node will revoke the shared access and disable the network port.
138
Map reduce
Split is a fixed chunk of data that will act as an input to the map
task .Blocks belong to the HDFS world and splits belong to the
Map reduce world. All map jobs run in parallel and produce
output. All the results are merged, shuffled and act as an input
to the reduce job .The whole job execution is controlled by Job
Tracker and Task Tracker. A name node in HDFS is an
analogues to job tracker in map reduce and data node in HDFS
is analogues to task tracker in map reduce .Task tracker runs the
map and reduce job on data nodes .Job Tracker schedule tasks
for the task trackers. Job Tracker and task tracker run as Java
Jobs and they are not the hardware .It is wiser to spend on the
hardware that runs task tracker .
139
Fig 5.8 Map Reduce Paradigm
The map jobs get their inputs which are local to the data nodes
.This is called as data locality else latency will be added to the
network .Hence optimal size of the block is equal to the split
size .Map task write the output on the local disk and not on
HDFS with replications since its only the intermediate result
.Job tracker cleans the map output only after the successful
completion of the reduce job .All the maps output is merged,
sorted and partitioned .Reducer wont get the data locally .It
would be fetched from the network .Reducers output is written
to HDFS for reliability .
140
Fig 5.10 Map Reduce Classes
Map Reduce Mechanism
Input to map is split and split will contain many records .The
map functions has input in the form of keys and values and
output in the form of keys and values as well .The key is unique
to every record. Map produces one or more key value pairs. The
input to map will have unique keys .The output may have non-
unique keys .This will be helpful since we will sort the values
based on key values and would like to make sense of values
with the same keys. One particular reducer must get all the
values for particular keys .This is managed by Hadoop and
programmer need not manage anything here .The input to
reducer has list of values associated to single key .The sequence
of values is not much important here .The output of the reducer
will be sorted since its receiving the inputs in the sorted manner.
The output of the reducer depends on the inputs to the map and
subsequently on the keys in the input to the map.
Taking an example of a word count program .Lets assume the
input to the map job is a String to err is human to forgive is
divine and the output from the reducer is
141
Divine 1
Err 1
Forgive 1
Human 1
Is 2
To 2
Brief algorithm for the same is given below:
The same can be achieved using parallel maps and parallel
reducers .For parallel reducers value for all the same keys
should go to the same reducer and the distribution should be
equal .The shuffle and sort step is very critical to any map
reduce solution. In Java , we will have the Map Class with the
map logic ,Reduce Class with the reduce logic and the Driver
Class which will control and decide the configuration and how
the job will read and write the data .The distribution of data and
the shuffle and sort steps is all taken care by the Hadoop itself .
142
Fig 5.11 Map Reduce functioning
Lets take an example of the code base for writing the map,
reduce functions.
importjava.io.IOException;
importjava.util.StringTokenizer;
The following are the import statements for key and value
datatypes.These are Hadoop datatypes.LongWritable is
something similar to Long in Java .Text is analogues to String
in Java .Intwritable is a datatype similar to Int in Java
143
importorg.apache.hadoop.io.IntWritable;
importorg.apache.hadoop.io.LongWritable;
importorg.apache.hadoop.io.Text;
importorg.apache.hadoop.mapreduce.Mapper;
publicclassWordCountMapper
extends Mapper<LongWritable, Text, Text, IntWritable> {
Every map class will extent the Mapper class and override the
map method .
LongWirtable and Text are input key and value data types
followed by the output key and value data types .
privatefinalstaticIntWritableone = newIntWritable(1);
private Text word = new Text();
We need to override map function providing the input key
,input value and context .Role of context is to catch the output
key and value pair .we tokenize the string into words and write
it into the context with word as key and one as value.
@Override
publicvoid map(LongWritablekey, Text value, Context
context)
throwsIOException, InterruptedException {
String line = value.toString();
StringTokenizeritr = newStringTokenizer(line);
144
while (itr.hasMoreTokens()) {
//just added the below line to convert everything to lower case
word.set(itr.nextToken().toLowerCase());
// the following check is that the word starts with an alphabet.
if(Character.isAlphabetic((word.toString().charAt(0)))){
context.write(word, one);
Lets now look at the reducer class .
importorg.apache.hadoop.io.IntWritable;
importorg.apache.hadoop.io.Text;
importorg.apache.hadoop.mapreduce.Reducer;
Every reduce class will have to extend Reducer Class and
override reduce () method.The parameters have input key and
value followed by the output key and value .The iterable values
here implies that the key will point to multiple values .Context
collect the output key and value pair through write method .The
processing logic will iterate over the values .The input key and
value of the reducer should match the output key and value of
the map function .
145
publicclassWordCountReducer
extends Reducer<Text, IntWritable, Text, IntWritable> {
@Override
publicvoid reduce(Text key, Iterable<IntWritable>values,
Context context)
throwsIOException, InterruptedException {
intsum = 0;
for (IntWritablevalue : values) {
sum += value.get();
context.write(key, newIntWritable(sum));
Finally the driver class is as below
importorg.apache.hadoop.fs.Path;
importorg.apache.hadoop.io.IntWritable;
importorg.apache.hadoop.io.Text;
importorg.apache.hadoop.mapreduce.Job;
importorg.apache.hadoop.mapreduce.lib.input.FileInputFormat
;
146
importorg.apache.hadoop.mapreduce.lib.output.FileOutputFor
mat;
publicclassWordCount {
publicstaticvoid main(String[] args) throws Exception {
if (args.length != 2) {
System.err.println("Usage: WordCount<input path><output
path>");
System.exit(-1);
Driver class set up the job so that Haddop can take up from that
point and execute as specifcied by the programmer .
The setJarByClass indentifes the driver class when it is
distributed across the cluster .The job name will be visible on
the UI .
Job job = newJob();
job.setJarByClass(WordCount.class);
job.setJobName("Word Count");
we set the input and output file paths for the job through
command line arguments .
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
We set up the mapper class and the reducer class .
147
job.setMapperClass(WordCountMapper.class);
job.setReducerClass(WordCountReducer.class);
We finally set the out key and value parameters by setting the
setOutputKeyClass and setOutputValueClass.This is the output
key and value type of the reducer .
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
The below statement triggers the submission o the job to
Hadoop.The distribution over the cluster and the network input
output is taken care by Hadoop.
System.exit(job.waitForCompletion(true) ? 0 : 1);
Assuming the Hadoop set up is done, we can execute the
program using the below command
Hadoop jar <jar file name ><driver class name ><input
filename ><output directory >
The input is supplied via the input file name and the output
directory contains the output .
Lets looks at the combiner code in execution when we have
multiple maps implemented .The function of combiner is to
process the map outputs locally so that there are less outputs to
transfer to the reducer .The map output is thus said to be
compressed .It reduces the load on the network bandwidth.
Combiners are reducers happening locally on map machines.
148
The combiners extend the Reducer class .The logic is written in
the reduce method. It is applied when the nature of the problem
is commutative and associative .Combiners can run multiple
times on maps output .
Here the combiner class extends Configured class and
implements Tool Interface .
publicclassWordCountWithCombinerextends Configured
implements Tool{
@Override
publicint run(String[] args) throws Exception {
Configuration conf = getConf();
Job job = newJob(conf, "MyJob");
job.setJarByClass(WordCount.class);
job.setJobName("Word Count With Combiners");
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setMapperClass(WordCountMapper.class);
This is the way we set up the Combiner.Here the reducer class
is re-used for combiner .This can be done when the function
performed is the same .
job.setCombinerClass(WordCountReducer.class);
job.setReducerClass(WordCountReducer.class);
149
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
returnjob.waitForCompletion(true) ? 0 : 1;
Here the ToolRunner invokes the run function.It gives the
ability to set the properties at the run time
publicstaticvoid main(String[] args) throws Exception {
intexitCode = ToolRunner.run(new Configuration(),
newWordCountWithCombiner(), args);
System.exit(exitCode);
We can leverage the Tool Runner property by providing the
argument as
-D mapred .reduce .tasks =0
Thus while designing a solution ,we should always divide the
problem into 2 parts the map part and the reduce part .The
map will take inputs into splits .For each record map function
will be called and it will break the record into keys and values
.Values with the same key should help us reaching to our
problem statement objective .The input to reducer is in the form
of keys and list of values and the result is in the form of pair of
keys and values .The map job and reducer job may run on
different machines altogether .The whole transfer of data is
managed by Hadoop using Shuffle ,sort and partition steps.
150
Using combiner reduces the map output and hence increases the
performance efficiency .The output of combiner has different
partitions done by the partition function .Each partition goes to
the same reducer. These partitioning is done on the map
machines locally. At times we can use the combiner class as
reducer but it is not always true .The partitions are merged into
a file and sent to reducers.
Fig 5.12 Map Reduce Combiner
Hadoop Data Types
When map communicates with reduce , the data is transferred in
terms of objects .Serialization is the process of turning the
structured objects into byte stream for transmission over the
151
network .De-serialization is the reverse process where the byte
stream is converted back into structured objects .All the
communications between map and reduce happens via RPC or
remote procedure calls .
The features of serializations are compact, fast ,extensible and
interoperable .Java serialization has overheads when the data
was serialized and is not compact .Hadoop uses Writable
serialization framework where in the client already knows about
what to expect .We can also write a custom writable but in that
case methods such as write, readFields, compareTo, hashcode,
equals and toString should be overloaded.
Partition function
It takes key value pair as input and produces an integer result
.This integer is used to decide to which reducer the key value
pari would go to.
publicclassDefaultDriverextends Configured implements Tool
{
@Override
publicint run(String[] args) throws Exception {
Job job = newJob(getConf());
job.setJarByClass(getClass());
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
returnjob.waitForCompletion(true) ? 0 : 1;
}
152
publicstaticvoid main(String[] args) throws Exception {
intexitCode = ToolRunner.run(newDefaultDriver(), args);
System.exit(exitCode);
In case we use a default class ,the default practitioner is used
which is a hash partioner .In such a above case the default
mapper ,reducer and practitioner will be used .
The above default drivers class run function interprets to the
following
publicint run(String[] args) throws Exception {
Job job = newJob(getConf());
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setMapperClass(Mapper.class);
153
job.setMapOutputKeyClass(LongWritable.class);
job.setMapOutputValueClass(Text.class);
job.setPartitionerClass(HashPartitioner.class);
job.setNumReduceTasks(1);
job.setReducerClass(Reducer.class);
job.setOutputKeyClass(LongWritable.class);
job.setOutputValueClass(Text.class);
job.setOutputFormatClass(TextOutputFormat.class);
job.setInputFormatClass(TextInputFormat.class);
returnjob.waitForCompletion(true) ? 0 : 1;
When we submit the job using waitForCompletion function
which is the last statement in the driver class (using the tool
runner ),it causes the job to be submitted for processing .the
property used by the job is the mapred.job.tracker which is
present in mapred-site.xml.In the pseudo distribution mode the
job tracker and task tracker emulate to run on different JVMs.In
.23 release the map reduce is built on a new framework called
as YARN and the framework of execution is decided by the
property mapreduce.framework.name .YARN stands for Yet
Another Resource Negotiator .It can be set to local , classic or
YARN .The job client in the client node interacts with the Job
Tracker on the job Tracker Node .The Job Tracker interacts
with Task Tracker on many different nodes .The client submits
the job to the job tracker .Various checks are conducted on the
Job Tracker e.g. is the input file present .On the task tracker
154
runs the map and reducer jobs . Task tracker sends reports to
the Job Tracker periodically.
Fig 5.13 Job Tracker and Task Tracker
YARN
YARN means Yet Another Resource Negotiator.It is also
known as Map-Reduce 2 .In the classic Map Reduce , when
the cluster size increased 4000+ nodes , the scalability got
saturated due to the load on the Job Tracker .The project started
by Yahoo aimed to increase the performance by smarter
memory utilization and enhanced scalability so that it could run
many versions of distributed framework in parallel on the same
cluster .Job Tracker responsibility got bifurcated into 2 portions
resource Manager (Job scheduling )and Application Master
(Task Monitoring ).With YARN only the way of execution got
changed .There is better memory utilization with the
introduction of concept of containers .YARN has the following
components :
155
Client
YARN Resource Manager (Scheduler and Application Master)
YARN Node Manager
Map Reduce Application Master
YARN Client
Distributed File System
The Job get submitted to Job client .Job Client requests for a
new application ID. It checks if the output directory is already
created .If it is it will throw and error and stop there itself .It
verifies the input directory and copies the resources to HDFS
with a high replication .It finally submits the application to
resource Manager .As soon as the resource Manager picks up
the new job , it will contact a Node Manager to start a new
container and launch a new Application master for the job
.Application Master creates an object for book keeping
purposes and task management purposes .It retrieves the split
from HDFS and creates one task per split .Then Application
master decides how to run the map reduce task .If the task is
small , it may be run on the same JVN itself .These small tasks
are known as Uber Tasks .If the task is not Uber , it requests the
resource manager to allocate the resources .Scheduler considers
data locality while assignment .In case its not able to find a
node which is Rack Local , it allocates any node randomly
.Application master then contacts the node manager to launch a
container for task executions .Then the YARN child is launched
.it runs on a separate JVM.Then YARN child retrieves all the
resources from HDFS , localizes them and run the Map Reduce
Task .YARN child sends the progress report every 3 seconds to
Application Master which aggregates the progress and updates
156
the client directly .In the completion phase , application master
and task Container clean up the intermediate data and
terminates itself on job completion .There can be some failure
scenarios namely
1. Task Failure For code related issues, application master
waits for time given by mapred.task.timeout .There can be
runtime errors or JVM errors .There will be another attempts
given by mapred.map.maxattempts and
mapred.reduce.maxattempts. Finally the complete job will
be marked as failed .Resource Manager stops getting the
heart beats .
2. Application Master Failure The task under need not be
resubmitted. It can be recovered in case
yarn.app.mapreduce.am.job.recovery.enable is set. Resource
Manager stops getting the heart beats .Resource manager
starts the application master on the new container.
3. Node Manager Failure Node manager stops sending the
heartbeats to resource manager .Resource Manager waits for
some time .All the remaining tasks are spawned under the
new node manager .Frequent failure blacklists the nodes.
4. Resource Manager Failure This is the most serious failure
.A check point mechanism is put in place which is an
improvement from the classic map reduce .A new resource
manager is brought up by the administrator.
157
Fig 5.14 YARN functioning
Pig
Pig is developed at Yahoo at the same time when facebook
worked on Hive .Thus there is an overlap in capabilities .The
idea was to help data scientists write map redude programs
quickly and easily .Pig is ideally suited for the complex
operations like the join operations .It is not suited if looking at a
smaller portion of data .Under the hood , Pig runs a series of
Map Reduce Programs .Pig should be used in complex join
operations .However for faster execution stick to map reduce
since the solutions written in Map Reduce are highly optimized
.Pig written programs are parsed and converted to Map Reduce
Programs .Pig solutions are no well optimized and hence will be
slower than the map reduce solutions till some time in future
.Pig has 2 components
1. Pig Latin which is the programming language It can be
coded as scripts or a Grunt mode (Interactive mode ) or
158
Embedded mode (Pig commands can be embedded into a
Java Program)
2. Environment to run the Pig Programs Its a tar file that
needs to be installed at the client node which translates the
Pig queries into Map Reduce jobs .The environment can
have the following 2 types of execution a. Local Mode
Execution (runs on a single JVM) b. Map Reduce Mode
(Translate to map reduce program and connects to Hadoop
and runs it on the Hadoop cluster)
Fig 5.15 Pig Attributes
Pig is a data flow language and assignment of data sets happens
to a variable .The following are some of the sample PIG
commands (with data set on the right ):
159
Hive
Hive was developed at facebook with a similar reason to that of
PIG.It was developed for data scientists with weak Java skills to
help them work on Hadoop.Hive is a SQL like language .It is
intended to perform data operations like slicing and dicing and
not advanced logical operations for which Map reduce is still
the best fit .In Hive , data should comply with the schema at the
time of read and not at the time of write .In HIVE command
delimitation plays an important role in the data read and store
operations .The rules and information about the table are stored
in the metastore at the time of creation of the table .Below are
few of the important HIVE Commands
There are 2 ways a table can be specified in HIVE - Managed
Table and External Table. A Managed table needs to be
managed by HIVE .It is the default way of creating a table
.When we perform a load command , the input data file will be
moved from an original location to a new location in HIVE
warehouse where it manages the file completely .A delete will
delete the file from the metastore and the ware house .In case of
an external table, during the load HIVE performs a link to the
dataset and doesnt even check in case the data is present .It
makes a related data entry in the Meta Store .This process of
late binding of schema is termed as Lazy in Hive .In case of
160
drop command for extended tables , the entry gets deleted from
the metastore and the data still resides there .The operations
discussed are shown below :
Data in Hive can be divided into partitions or buckets .The
syntax for the same is shown below:
There are 2 formats in which data is stored in Hive Row
format and file format .Row format deals with how the data
fields are stored in the Hive Table .-how the fileds , rows ,keys
and values will be delimited .Terminology used to describe the
Row Format is SerDe.File Format can be Sequence Files or RC
161
files .Row oriented layout is called as sequence files and
column oriented layout is termed as the RC files .RC files are
used when a portion of columns need to read repeatedly and
others need to be discarded .
Sqoop
Sqoop is a tool developed by Apache to ingest the data into
Hadoop and export data out of Hadoop.It is a tool designed for
efficiently transferring the bulk data between Hadoop and
structured data sources .It uses the power of parallelism by
effectively using the map reduce operations .The data sources
are generally relational databases .The target destination could
be HDFS Or Hbase .The format could be delimited text
,AVRO or Sequence Files .s
Fig 5.16 Import/Export using Sqoop
162
Following are the import statements and options in Sqoop.
163
CHAPTER 6 - CRYPTO-ALGORITHMS IN JAVA
Getting Strarted
Here we will study the Java cryptography architecture for
secure hashing , files hashing and password hashing. This can
be utilized in java android applications and desktop
applications. The various types of hashing described are simple
hashing , Stream based hashing ,Message Authentication Code
And Secure Password hashing using
PBKDF2/PKCS#5.(following the security scheme PBKDF2
from the security standard PKCS#5)
Crypto Algorithms
Java 1.8 includes all state of the art crypto algorithms .We used
apache codec which includes hash encoding base 64 which
most java platforms do not include .Java cryptography
framework uses the factory pattern in the form of message
digest for creating hashes.
Simple Hash
We are getting an instance of message disest (hash function)
we need to specify the algorithm a string code MD5
.Message digest works on the basis of bytes .Following is the
snippet for simple hash.
package javacryptography;
import java.io.UnsupportedEncodingException;
import java.security.MessageDigest;
import java.security.NoSuchAlgorithmException;
164
import java.util.logging.Level;
import java.util.logging.Logger;
/**
*
* @author deepakmali
*/
public class SimpleHash {
/**
* @param args the command line arguments
*/
public static void main(String[] args) {
// TODO code application logic here
String str = "Test String";
try {
MessageDigest md =MessageDigest.getInstance("MD5");
byte [] hash = md.digest(str.getBytes("UTF-8"));
System.out.println(hash.length);
for (byte b : hash)
System.out.print(b + " ,");
} catch (NoSuchAlgorithmException ex) {
Logger.getLogger(SimpleHash.class.getName()).log(Level.SE
VERE, null, ex);
165
} catch (UnsupportedEncodingException ex) {
Logger.getLogger(SimpleHash.class.getName()).log(Level.SE
VERE, null, ex);
Input and output both are in bytes format for the hash function
and commons codec will display the result of the hash function
.Digest method produces a direct output.The method
getByptes() is very error prone since it depends on the default
encoding of the JVM running.Its good to mention the encoding
we want to use with getBytes() method- in our case its UTF -8 .
The console output is a MD5 hash of the length of 16 - -67 ,8 ,-
70 ,60 ,-104 ,46 ,-86 ,-41 ,104 ,96 ,37 ,54 ,-5 ,-114 ,17 ,-124
Lets now write a code to verify the MD5 checksum for the
apache-codec file downloaded from apache website .We will
create the hash value for the file .We will create the helper
method to read the bytes from the file .It will read a file or any
other input stream and return a byte array. Following is the
snippet.
package javacryptography;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.UnsupportedEncodingException;
import java.security.MessageDigest;
166
import java.security.NoSuchAlgorithmException;
import java.util.logging.Level;
import java.util.logging.Logger;
import org.apache.commons.codec.binary.Hex;
/**
* @author deepakmali
*/
public class FileHash {
public static void main(String[] args) throws IOException {
// TODO code application logic here
String path =
"C:\\Users\\deepakmali\\Desktop\\Aadhar\\commons-codec-
1.10-bin.zip";
byte [] bytes = Helper.read(new FileInputStream(path));
try {
MessageDigest md =MessageDigest.getInstance("MD5");
byte [] hash = md.digest(bytes);
System.out.println(hash.length);
for (byte b : hash)
System.out.print(b + " ,");
167
String hash_encoded = Hex.encodeHexString(hash);
System.out.println (hash_encoded);
} catch (NoSuchAlgorithmException ex) {
Logger.getLogger(SimpleHash.class.getName()).log(Level.SE
VERE, null, ex);
package javacryptography;
import java.io.ByteArrayOutputStream;
import java.io.IOException;
import java.io.InputStream;
/**
* @author deepakmali
*/
public class Helper {
public static byte [] read (InputStream is) throws IOException{
ByteArrayOutputStream bos =new ByteArrayOutputStream();
byte [] buffer = new byte [1024 * 16];
168
int length ;
while (true)
length= is.read(buffer);
if (length <0)
break;
bos.write(buffer,0,length);
is.close();
return bos.toByteArray();
File Stream Hash and MAC
We want to work stream based in case the file is huge .We will
now work on the file stream hash We make use of an Index
based method to update the digest . In cryptography ,
authentication codes are abbreviated as MAC .We should not
confuse them with the Ethernet address .Mac can also be
realized using hash functions as a basis. In SSL ,a HMAC is
used to build a MAC and is used to hash a secret document .
Secret can be arbitrary byte vector or password based .We need
to adhere a secret to the hash using an init method .We will still
get a 16 bytes output but its now a password based hash (a
MAC)We can use other hash functions and also once the
169
password is change the result is going to change .For MAC , we
can use the SHA256 algorithm .Crypto-specialists have created
the algorithm which is very mature and easy to use .The
following are the snippet for File Stream Hash and MAC.
import java.io.FileInputStream;
import java.io.IOException;
import java.security.MessageDigest;
import java.security.NoSuchAlgorithmException;
import java.util.logging.Level;
import java.util.logging.Logger;
import org.apache.commons.codec.binary.Hex;
/**
* @author deepakmali
*/
public class FileStreamHash {
public static void main(String[] args) throws IOException {
// TODO code application logic here
String path =
"C:\\Users\\deepakmali\\Desktop\\Aadhar\\commons-codec-
1.10-bin.zip";
170
try {
MessageDigest md =MessageDigest.getInstance("MD5");
byte [] buffer = new byte [1024 * 16];
int length ;
FileInputStream fis = new FileInputStream(path);
while (true)
length= fis.read(buffer);
if (length <0)
break;
md.update(buffer,0,length);
fis.close();
byte[] hash = md.digest();
for (byte b : hash)
System.out.print(b + " ,");
System.out.println(hash.length);
String hash_encoded = Hex.encodeHexString(hash);
System.out.println (hash_encoded);
} catch (NoSuchAlgorithmException ex) {
171
Logger.getLogger(SimpleHash.class.getName()).log(Level.SE
VERE, null, ex);
}
}
}
package javacryptography;
import java.io.UnsupportedEncodingException;
import java.security.InvalidKeyException;
import java.security.MessageDigest;
import java.security.NoSuchAlgorithmException;
import java.util.logging.Level;
import java.util.logging.Logger;
import javax.crypto.Mac;
import javax.crypto.spec.SecretKeySpec;
import org.apache.commons.codec.binary.Hex;
/**
*
* @author deepakmali
*/
public class SimpleMac {
public static void main(String[] args) {
String str = "Test String";
try {
172
// Mac mac = Mac.getInstance("HMACMD5");
Mac mac = Mac.getInstance("HMACSHA256");
mac.init(new SecretKeySpec("Password ".getBytes("UTF-
8"),""));
byte [] hash = mac.doFinal(str.getBytes("UTF-8"));
System.out.println(hash.length);
for (byte b : hash)
System.out.print(b + " ,");
String hash_encoded = Hex.encodeHexString(hash);
System.out.println (hash_encoded);
} catch (NoSuchAlgorithmException ex) {
Logger.getLogger(SimpleHash.class.getName()).log(Level.SE
VERE, null, ex);
} catch (UnsupportedEncodingException ex) {
Logger.getLogger(SimpleHash.class.getName()).log(Level.SE
VERE, null, ex);
} catch (InvalidKeyException ex) {
Logger.getLogger(SimpleMac.class.getName()).log(Level.SEV
ERE, null, ex);
}
}
}
173
PBKDF2
The problem with these hash functions is the lengths of the
password is short and brute force attacks are possible. Crypto
specialist created special hashing schemes for password
hashing. Secure password hashing include salting (Addition of
data to the password before the hash is generated) and iterations
(re-hashing multiple times).Secure schemes for hashing the
password is PBKDF2 from PKCS5.A very popular
implementation of PBKDF2 is done by the open source crypto
provider Bouncy Castle. It is so robust that even if the attacker
gets the hash , he will not be able to restore the password .An
implementation based on bouncy castle is shown below
.(Bouncy Castle implementation is a mix of salting and
iterations )It is HMAC based and follows the salting and
iteration principles.
package pbkdf2_test;
/**
*
* @author deepakmali
*/
import java.util.Arrays;
/**
* This class demonstrates the usage of PBKDF2 secure
password hash creation
* and allows to play around with the input parameters (***)
* for better understanding.
*
174
* For production usage, please note the "TODO" tags!
*
* Works for Java and Android.
*
*/
public class PBKDF2Example {
public static void main(String[] args) throws Exception{
String demo_password = "Password";
//TODO: Has to be
//1)generated by secure random
//2) individual per case
//3) with appropriate length
byte[] demo_salt = {
1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,
//1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,1
6, //***
};
int demo_iterationcount = 10000;
//demo_iterationcount++; //***
int demo_size = 256; //size in bits
175
//demo_size+=8; //***
//TODO: Salt + Iteration count
+ Size have to be stored with the hash to recalculate the hash
later
// -> Iteration count + Size
can be constant values, the salt has to be stored individually
//--------------
//---CREATION---
//--------------
PBKDF2 hashgen = new PBKDF2();
hashgen.init(demo_password.getBytes
("UTF8"), demo_salt, demo_iterationcount, demo_size);
byte[] rawhash =
hashgen.generateDerivedParameters();
System.out.println("LEN: " + rawhash.length * 8);
for(byte b: rawhash)
System.out.print(b + ", ");
byte[] stored_hash = rawhash;
//TODO: to be persisted
//------------------
//---VERIFICATION---
//------------------
System.out.println("\n\n===Second
attempt===\n");
//demo_password = "Wrong"; //***
176
hashgen = new PBKDF2();
hashgen.init(demo_password.getBytes
("UTF8"), demo_salt, demo_iterationcount, demo_size);
rawhash = hashgen.generateDerivedParameters();
System.out.println("LEN: " + rawhash.length * 8);
for(byte b: rawhash)
System.out.print(b + ", ");
System.out.println("\nVerification: "
+ Arrays.equals(stored_hash, rawhash));
}
}
package pbkdf2_test;
/**
*
* @author deepakmali
*/
/**
* This class is based on the Bouncy Castle Java class
*
org.bouncycastle.crypto.generators.PKCS5S2ParametersGenera
tor
* It uses the SHA512 HMAC as default. Only JCE/JCA classes
are being used.
177
* HmacSHA512 is included in JRE 1.8++. This class also runs
below 1.8.
* Then, BC or others have to be used to include the
HmacSHA512 algorithm
* or a different HMAC has to be used.
* The Bouncy Castle License
* Copyright (c) 2000-2015 The Legion Of The Bouncy Castle
Inc.
* (http://www.bouncycastle.org)
* <p>
* Permission is hereby granted, free of charge, to any person
obtaining a copy
* of this software and associated documentation files (the
"Software"), to deal
* in the Software without restriction, including without
limitation the rights
* to use, copy, modify, merge, publish, distribute, sublicense,
and/or sell
178
* copies of the Software, and to permit persons to whom the
Software is
* furnished to do so, subject to the following conditions:
* <p>
* The above copyright notice and this permission notice shall
be included in
* all copies or substantial portions of the Software.
* <p>
* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT
WARRANTY OF ANY KIND, EXPRESS OR
* IMPLIED, INCLUDING BUT NOT LIMITED TO THE
WARRANTIES OF MERCHANTABILITY,
* FITNESS FOR A PARTICULAR PURPOSE AND
NONINFRINGEMENT. IN NO EVENT SHALL THE
* AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR
ANY CLAIM, DAMAGES OR OTHER
* LIABILITY, WHETHER IN AN ACTION OF
CONTRACT, TORT OR OTHERWISE, ARISING FROM,
* OUT OF OR IN CONNECTION WITH THE SOFTWARE
OR THE USE OR OTHER DEALINGS IN THE
* SOFTWARE.
179
*
* Generator for PBE derived keys and ivs as defined by PKCS
5 V2.0 Scheme 2.
* <p>
* The document this implementation is based on can be found
at <a
* href=http://www.rsasecurity.com/rsalabs/pkcs/pkcs-
5/index.html> RSA's PKCS5
* Page</a>
*/
import javax.crypto.Mac;
import javax.crypto.spec.SecretKeySpec;
public class PBKDF2 {
private byte[] password;
private byte[] salt;
private int iterationCount;
private int keysize;
private Mac hMac;
private byte[] state;
public PBKDF2() {
try {
180
hMac = Mac.getInstance("HmacSHA512");
} catch (Exception e) {
e.printStackTrace();
state = new
byte[hMac.getMacLength()];
private void F(byte[] S, int c, byte[]
iBuf, byte[] out, int outOff) throws Exception{
if (c <= 0) {
throw new
IllegalArgumentException("iteration count must be at least 1.");
if (S != null) {
hMac.update(S, 0, S.length);
hMac.update(iBuf, 0, iBuf.length);
hMac.doFinal(state, 0);
System.arraycopy(state, 0, out, outOff, state.length);
for (int count = 1; count < c; count++) {
181
hMac.update(state, 0, state.length);
hMac.doFinal(state, 0);
for (int j = 0; j != state.length; j++) {
out[outOff + j] ^= state[j];
private byte[] generateDerivedKey(int
dkLen) throws Exception{
int hLen = hMac.getMacLength();
int l = (dkLen + hLen - 1) / hLen;
byte[] iBuf = new byte[4];
byte[] outBytes = new byte[l * hLen];
int outPos = 0;
hMac.init(new SecretKeySpec(password, ""));
for (int i = 1; i <= l; i++) {
// Increment the value in 'iBuf'
int pos = 3;
while (++iBuf[pos] == 0) {
--pos;
182
}
F(salt, iterationCount, iBuf, outBytes, outPos);
outPos += hLen;
return outBytes;
public void init(byte[] password,
byte[] salt, int iterationCount, int keysize) {
keysize = keysize / 8;
this.password = password;
this.salt = salt;
this.iterationCount = iterationCount;
this.keysize = keysize;
public byte[] generateDerivedParameters() throws Exception {
byte[] dKey =
generateDerivedKey(keysize);
if(dKey.length < keysize)
throw new IllegalStateException("General failure: Generated
array not in keysize!");
183
if (dKey.length > keysize){
byte[] dk2 = new byte[keysize];
System.arraycopy(dKey, 0, dk2, 0, keysize);
dKey = dk2;
return dKey;
184
CHAPTER 7 MODULARITY IN JAVA
Getting Started
A large number of massive Java applications are live in
production are bundled with many jar files. Lack of structured
or modularize approach is a maintenance nightmare for any
application. With Java 9, we have some good news here
.Modularity features in Java 9 has come a long way making the
process of modularization easier and accessible. Java 9 has
taken a bottom up approach imbibing modularity into its core
libraries, language and JVM. Modularization involves a large
scale change to an application with an enormous impact on the
design, compilation, packaging, etc. Its more fundamental to
approach than merely a new feature.
Meaning of Modularity
Modularity means decomposing an application into self-
contained units ,and each unit encompasses code , metadata
about the module and metadata describing its relation with other
modules within the application .Each such unit could be an
business function (e.g. a Product catalogue in an e commerce
application )or a technical unit (e.g. database access logic
).Modularity demands strong encapsulation to prevent other
modules effecting the internals and establish communication via
the public assessors.In such a case the internals can be updated
without any regression effects .Each module should be exposed
via well-defined interfaces since this is the single point of
communication between two modules and any change at the
interface level could potentially break the dependent modules.
Also, to ensure that the metadata pointing to the dependencies is
well defined, we should be very clear with all the modules
dependencies (at times referred as module graph)
185
Pre-Java9 Era
Packages and access modifiers (public, private,protected) have
provided some form of encapsulation and interfaces have
offered modularization in pre-Java 9 era. The approach of
hiding the details behind the public interfaces have been long
seen and acted as a basis for designing modular codebases. For
using classes from other packages bundled within an external
jar, we use import statements .Not writing proper imports will
be a compile time catch , however to know which dependent
jars need to be available at runtime when our jar refers them ,
we restore to tools like Maven and OGSI.
Jars are a close approximation of modules in the pre- Java 9
world .In case our Test Application uses hibernate validator jar
which brings further jars through transitive dependency
resolution.
Fig 7.1 Modularity pre Java 9
Since the classes bundled with jars have public APIs or
interfaces exposed .In case these public interfaces change, our
code will break .The situation describes lack of true
encapsulation. When we execute a java program explicitly
providing the external dependencies with the classpath
argument, the JVM loads the classes sequentially .In case any
class is missed or two different versions of the same jar exists, a
runtime exception is bound to happen. A safe bet is to use Java
9 modules.
186
Java 9 modules
With java 9, the JDK itself is modularized and the applications
are built using a modular system.Modules arestrongly
encapsulated entitiesand they express the dependency on other
modules explicitly in the modular system. Extending it to the
above example, the hibernate validator will mark classmate jar,
jboss-logmanager jar and validation api jar as part of its module
descriptor. Each module will have a publicly accessible part and
an encapsulated part .The application can also refer to a jar or
module which is part of the JDK e.g. java.sql jar although the
application will have default access to platform module
(java.base )containing packages like java.lang and java.util .The
modular system offer many advantages :
Explicit dependencies make the encapsulation stronger.Any
accidental references to the internal /transitive dependencies
are prevented.
The dependencies are resolved at compile time as well as
runtime
Well defined public interfaces enable teams to work in
parallel
Minimal configuration for modules
Modularizing JDK was a very challenging task .There were
many jars bundled with rt.jar (jar all the java applications
use)which may or may be used by all the applications .Thus in
the cloud solutions deployed over small containers we often
face resource constraints. At the same time , we cannot remove
these jars blindly in order to support backward compatibility. In
the modular Java , application developers can choose to ignore
the modules not in use. The task to bring down the monolithic
JDK is also challenging because apart from obsolete
technologies (COBRA jar) there is an aspect of usability .A web
187
application may not use any AWT or Swing utility but a thick
client application may .Using only the requires platform
modules will reduce the security risks as well.The first step
towards a more modular JDK was taken in Java 8 with the
introduction of compact profiles. A profile defines a subset of
packages from the standard library available to applications
targeting that profile. Three profiles are defined, imaginatively
called compact1, compact2 and compact3. Each of these
profiles is a superset of the previous, adding more packages that
can be used. The Java compiler and runtime were updated with
knowledge of these pre-defined profiles. Java SE Embedded 8
offers low footprint runtimes matching the compact profiles.
The JDK now consists of about 90 platform modules, instead of
a monolithic library. Every module constitutes a well-defined
piece of functionality of the JDK, ranging from logging to XML
support. All modules explicitly define their dependencies on
other modules.For example, java.logging has many incoming
dependencies, meaning it used by many other platform
modules. That makes sense for a central functionality such as
logging. Module java.xml.ws (containing the JAX-WS APIs for
SOAP WebServices) has many outgoing dependencies,
including an unexpected one on java.desktop. Aggregator
modules roughly corresponding to the previously discussed
compact profiles have been defined as
well: java.compact1, java.compact2 and java.compact3.
Decomposing the JDK into modules has been a tremendous
amount of work. Splitting up an entangled, organically grown
codebase containing tens of thousands of classes into well-
defined modules with clear boundaries, while retaining
backwards compatibility, takes time. This is one of the reasons
it took a long time to get a module system into Java. With over
20 years of legacy accumulated, many dubious dependencies
188
had to be untangled. Going forward, this effort will definitely
pay off in terms of development speed and increased flexibility
for the JDK.
Module Descriptors
Now that we have a high level overview of the JDK module
structure, lets explore how it works. What is a module, and
how is it defined? A module has a name, it groups related code
and possibly other resources, and is described by a module
descriptor. The module descriptor lives in a file called module-
info.java. Heres an example of the module descriptor for
the java.prefs platform module:
module java.prefs {
requires java.xml;
exports java.util.prefs;
1. The requires keywords indicates a dependency, in this case
on java.xml.
2. A single package from the java.prefs module is exported to
other modules.
Modules live in a global namespace, therefore module names
must be unique. As with package names, you can use
conventions like reverse DNS notation
(e.g. com.mycompany.project.somemodule) to ensure
uniqueness for your own modules. A module descriptor always
starts with the module keyword, followed by the name of the
module. Then, the body of module-info.java describes other
characteristics of the module, if any.
189
Lets move on to the body of the module descriptor
for java.prefs. Code in java.prefs uses code from java.xml to
load preferences from XML files. This dependency must be
expressed in the module descriptor. Without this dependency
declaration, the java.prefs module would not compile (or run).
Such a dependency is declared with the requires keyword
followed by a module name, in this case java.xml. The implicit
dependency on java.base may be added to a module descriptor.
Doing so adds no value, similar to how you can (but generally
dont) add "import java.lang.String" to code using Strings.
A module descriptor can also contain exports statements. Strong
encapsulation is the default for modules. Only when a package
is explicitly exported, like java.util.prefs in this example, can it
be accessed from other modules. Packages inside a module that
are not exported, are inaccessible from other modules by
default. Other modules cannot refer to types in encapsulated
packages, even if they have dependency on the module.
Readability
A module lists the other modules it depends on to compile and
run. Again from the State of the Module System This concept
of readability is the basis of reliable configuration: When any
condition is violated, the module system refuses to compile or
launch the code; an immense improvement over the brittle
classpath model.
Accessibility
A module lists the packages it exports. A type in one module is
only accessible by code in another module if:
the type is public
190
the containing package is exported by the first module
the second module reads the first
This means that public is no longer really public. A public type
in a non-exported package is as hidden from the outside world
as a non-public type in an exported package. So "public is
even more hidden than package-private types are today, because
the module system does not evenpermit access via reflection.
As Jigsaw is currently implemented, command line flags are the
only way around this.
So accessibility builds on top of readability and the export
clauses, creating the basis for strong encapsulation, where a
module author can clearly express which parts of the modules
API are public and supported.
Creating the first module
Say we have an application that monitors microservices running
in our network. It periodically contacts them and uses their
responses to update a database table as well as a nifty JavaFX
UI. For the time being, lets assume that the application is
developed as a single project without any dependencies.
Now lets switch to Java 9 with Jigsaw! (The early access
builds are availale on java.net - the code samples and
commands presented in this article were created for build 96
from December 22nd.) While jdeps, the Java dependency
analyzer, already exists in JDK 8, we need to use the JDK 9
version so that it knows about modules.
The first thing to note is that we can simply choose to ignore
modules. Unless the code is relying on internal APIs or some
other JDK implementation details (in which case things will
191
break), the application can be compiled and executed exactly as
with Java 8. Simply add the artifact (and its dependencies if it
has any) to the classpath and call main. And voil, it runs!
To move the code into a module we must create a module
descriptor for it. This is a source code file called module-
info.java in the source directorys root.
module com.infoq.monitor {
// add modules our application requires
// add packages our application exports
Our application is now a module with the name
com.infoq.monitor. To determine which modules it depends on,
we can use jdeps:
jdeps-module ServiceMonitor.jar
This will list the packages our application uses and, more
importantly, from which modules they
come: java.base, java.logging, java.sql, javafx.base,
javafx.controls, javafx.graphics.
With our dependencies covered we can now think about which
packages we might export. Since we are talking about a stand-
alone application, we wont export anything.
module com.infoq.monitor {
requires java.base;// see more about that below
192
requires java.logging;
requires java.sql;
requires javafx.base;
requires javafx.controls;
requires javafx.graphics;
// no packages to export
Compiling is the same as without Jigsaw except we have to
include module-info.java in the list of source files:
javac-d classes/com.infoq.monitor ${source files}
Now that all classes were compiled
into classes/com.infoq.monitor, we can create a JAR from them:
jar -c \
--file=mods/com.infoq.monitor.jar \
--main-class=com.infoq.monitor.Monitor \
${compiled class files}
The new --main-class can be used to specify the class
containing the applications entry point. The result is a so-called
modular JAR, which we will discuss further below.
193
In stark contrast to the old model, there is a whole new
sequence to launch the application. We use the new -mp switch
to specify where to look for modules, and -m to name the one
we want to launch:
java -mp mods -m com.infoq.monitor
Kinds Of Modules
The JDK itself was modularized and consists of about
80 platform modules (have a look at them with java -listmods).
The ones that will be standardized for Java 9 SE are prefixed
with java., the JDK-specific ones with jdk..
Not all Java environments have to contain all platform modules.
On the contrary, one goal of Jigsaw is the scalable platform,
which means that it is easy to create a runtime with just the
desired modules.
All Java code depends on Object and virtually all code uses
basic features like threading and collections. These types can be
found in java.base, which plays a special role; it is the only
module that the module system inherently knows about, and
since all code depends on it, all modules automatically read it.
So in our example above, we would not need to declare a
dependency on java.base.
Non-platform modules are called application modules and the
module used to launch an application (the one containing main)
is the initial module.
194
Module Jars
As we have seen above, Jigsaw still creates JARs, albeit with
the new semantics. If they contain a module-info.class they are
called modular JARs, about which State of the Module
System says:
A modular JAR file is like an ordinary JAR file in all possible
ways, except that it also includes a module-info.class file in its
root directory.
Modular JARs can be put on the classpath as well as on the
module path (see below). This allows projects to release a
single artifact and let its users decide between the legacy or the
modular approach to application building.
Module Path
Platform modules are part of the currently used environment
and hence are easily available. To let the compiler or VM know
about application modules, we have to specify a module
path with -mp (as we have done above).
The platform modules in the current environment and the
application modules from the module path together comprise
the universe of observable modules. Given that set and an initial
module contained therein the virtual machine can create a
module graph.
Module Graph
Starting from the initial application module, the module system
resolves all of its transitive dependencies. The result is the
module graph where modules are nodes and a dependency of
one module on another is a directed edge.
195
For our example, that looks like this:
Fig 7.2 Module Graph - I
The platform modules are shown in blue, where the brighter
modules are the ones we depend on directly and the darker ones
transitive dependencies. The ubiquitous java.base is not shown;
remember, all modules implicitly depend on it.
196
Splitting Modules
With this understanding of how modules are handled by the
compiler and virtual machine, lets start to think about how to
divide our application into modules.
Our architecture consists of the following parts:
contacting the microservices and creating statistics
updating the database
presenting the JavaFX user interface
wiring the pieces
Lets go ahead and create a module for each:
com.infoq.monitor.stats
com.infoq.monitor.db
com.infoq.monitor.ui
com.infoq.monitor
The Jigsaw Quick-Start Guide and the JDK itself propose to
have a folder for each module below the projects root source
folder. In our case this would have the following structure:
Service Monitor
src
com.infoq.monitor
com ...
197
module-info.java
com.infoq.monitor.db
com ...
module-info.java
com.infoq.monitor.stats
com ...
module-info.java
com.infoq.monitor.ui
com ...
module-info.java
This tree is shown truncated here, but each com directory
represents the package of the same name and would contain
more subdirectories and eventually the modules code.
The statistics module depends on java.base (but as we learned,
we dont have to list that) and uses Javas built-in logging
facilities. Its public API allows to request aggregated and
detailed statistics and is contained in a single package.
module com.infoq.monitor.stats {
requires java.logging;
exports com.infoq.monitor.stats.get;
198
To reiterate the accessibility rules: non-public types in
com.infoq.monitor.stats.get and all types in other packages are
completely hidden from all other modules. Even the exported
package is only visible to modules that read this one.
Our database module also logs and obviously requires Javas
SQL features. Its API consists of a simple writer.
module com.infoq.monitor.db {
requires java.logging;
requires java.sql;
exports com.infoq.monitor.db.write;
}
The user interface obviously requires the platform modules that
contain the employed JavaFX features. Its API consists of the
means to launch the JavaFX UI. This will return a model that
uses JavaFX properties. The client can update the model, and
the UI will show the new state.
module com.infoq.monitor.ui {
requires javafx.base;
requires javafx.controls;
requires javafx.graphics;
exports com.infoq.monitor.ui.launch;
exports com.infoq.monitor.ui.show;
199
Now that we covered the actual functionality, we can turn our
attention to the main module, which wires all of the parts
together. This module requires each of our three modules, plus
Javas logging facilities. Since the UI requires its
dependent module to work with JavaFX properties, it depends
on javafx.base as well. And since the main module is not used
by anybody else, it has no API to export.
module com.infoq.monitor {
requires com.infoq.monitor.stats;
requires com.infoq.monitor.db;
requires com.infoq.monitor.ui;
requires javafx.base;// to update the UI model
requires java.logging;
// no packages to export
Its a little unwieldy that we have to explicitly require
javafx.base, but if we dont, we couldnt call any code from
javafx.beans.property, which we must do to work with the
model. So in a way com.infoq.monitor.ui is broken without
javafx.beans.property. For this case the Jigsaw prototype
provides the concept of implied readability, which we will
cover shortly.
200
Our modularization creates a very different module graph:
Fig 7.3 Module Graph-II
Now let's take a look at some of the commands used to compile,
package, and launch our freshly modularized application. This
is how we can compile and package modules that have no
dependencies outside of the JDK itself:
javac -d classes/com.infoq.monitor.stats ${source files}
jar -c \
--file=mods/com.infoq.monitor.stats.jar \
${compiled class files}
201
Like before and exactly as with Java 8, we compile the
modules source, write the resulting files into a subfolder of
classes with the modules name, and create a JAR in mods. Its
a modular JAR because the class files include the compiled
module descriptor module-info.class.
More interesting is how we process our only module with
dependencies on other application modules:
javac \
-mp mods \
-d classes/com.infoq.monitor \
${list of source files}
jar -c \
--file=mods/com.infoq.monitor.jar \
--main-class=com.infoq.monitor.Monitor \
${compiled class files}
java -mp mods -m com.infoq.monitor
The compiler needs the application modules we created earlier,
and we point it there by specifying the module path with -
mp mods. Packaging and launching is as before but
now modscontains not just a single module but four of them.
202
The JVM will start by looking for module com.infoq.monitor
because we specified that as our initial module. When it finds
that, it will try to transitively resolve all of its dependencies
inside the universe of observable modules (in this case, our four
application modules and all platform modules).
If it can build a valid module graph, it will finally look up the
main method specified with --main-class=com.infoq.monitor.
Monitor and launch our application. Otherwise it will fail with
an exception informing us of the violated condition, e.g. a
missing module or a cyclic dependency.
Implied Readability
A modules dependency on another module can take two forms.
First, there are dependencies that are consumed internally
without the outside world having any knowledge of them. Take
for example Guava, where the code depending on a module
does not care at all whether it internally uses immutable lists or
not.
This is the most common case and covered by readability as
described above, where a module can only access another
modules API if it declares its dependency on it. So if a module
depends on Guava, other modules are left in blissful ignorance
of that fact and would not have access to Guava without
declaring their own explicit dependencies on it.
But there is another use case where the dependency is not
entirely encapsulated, but lives on the boundary between
modules. In that scenario one module depends on another, and
exposes types from the depended-upon module in its own
public API. In the example of Guava a modules exposed
methods might expect or return immutable lists.
203
So code that wants to call the dependent module might have to
use types from the depended-upon module. But it cant do that
if it does not also read the second module. Hence for the
dependent module to be usable, client modules would all have
to explicitly depend on that second module as well. Identifying
and manually resolving such hidden dependencies would be a
tedious and error-prone task.
This is where implied readability comes in:
[We] extend module declarations so that one module can grant
readability to additional modules, upon which it depends, to
any module that depends upon it. Such implied readability is
expressed by including the public modifier in a requires clause.
In the example of a modules public API using immutable lists,
the module would grant readability to Guava to all other
modules depending on it by requiring it publicly. In this way its
API is immediately usable.
Implying Readability
Lets turn to our UI module, which exposes a model that uses
types from the module javafx.base in its API. We can now
change the module descriptor so that it publicly requires that
module:
module com.infoq.monitor.ui {
// expose javafx.base to modules depending on this one
requirespublic javafx.base;
requires javafx.controls;
requires javafx.graphics;
exports com.infoq.monitor.ui.launch;
204
exports com.infoq.monitor.ui.show;
}
Our main module can now implicitly read javafx.base and no
longer has to explicitly depend on it, since its dependency
com.infoq.monitor.ui exports it:
module com.infoq.monitor {
requires com.infoq.monitor.stats;
requires com.infoq.monitor.db;
requires com.infoq.monitor.ui;
// we dont need javafx.base anymore to update the UI model
requires java.logging;
// no packages to export
}
While this changes why com.infoq.monitor reads com.infoq.
monitor.stats it does not change the fact that it does read it.
Consequently this changes neither our module graph nor the
commands required to compile and launch the application.
Beyond Module Boundaries
To quote once again the design overview:
In general, if one module exports a package containing a type
whose signature refers to a package in a second module then
the declaration of the first module should include a requires
public dependence upon the second. This will ensure that other
modules that depend upon the first module will automatically be
205
able to read the second module and, hence, access all the types
in that modules exported packages.
But how far should we take this? Look, for example, at
the java.sql module. It exposes the interface Driver, which
contains a public method getParentLogger() returning aLogger.
Because of that the module publicly requires java.logging. So
any module using Javas SQL features can also implicitly
access the logging API.
With that in mind, lets have a second look at our database
module:
module com.infoq.monitor.db {
requires java.logging;
requires java.sql;
exports com.infoq.monitor.db.write;
Technically the declaration of requiring java.logging is not
needed and might seem redundant. So should we drop it?
To answer this question we have to look at how
exactly com.infoq.monitor.db uses java.logging. Our module
might only need to read it so we are able to
call Driver.getParentLogger(), do something with
the logger (e.g. log a message) and be done with it. In this case
our codes interaction with java.logging happens in the
immediate vicinity of its interaction
with Driver from java.sql. Above we called this the boundary
between com.infoq.monitor.db and java.sql.
206
Alternatively we might be using logging
throughout com.infoq.monitor.db. Then, types
from java.logging appear in many places independent
of Driver and can no longer be considered to be limited to the
boundary of com.infoq.monitor.db and java.sql.
With Jigsaw being cutting edge, the community still has time to
discuss such topics and agree on recommended practices. My
take is that if a module is used on more than just the boundary
to another module it should be explicitly required. This
approach clarifies the systems structure and also future-proofs
the module declaration for refactorings. So as long as our
database module uses logging independently of the SQL
module I would keep it.
Aggregator Modules
Implied readability opens up the path to so-called aggregator
modules, which contain no code on their own but aggregate a
number of other APIs for easier consumption. This is already
being employed by the Jigsaw JDK, which models compact
profiles as modules that simply expose the very modules whose
packages are part of the profile.
In case of our microservice monitor we could envision a single
API module aggregating the modules for statistics, user
interface and data, so that the main module only has a single
dependency:
module com.infoq.monitor.api {
requirespublic com.infoq.monitor.stats;
requirespublic com.infoq.monitor.db;
requirespublic com.infoq.monitor.ui;
207
// implied readability is not transitive
// so we have to explicitly list `javafx.base`
requirespublic javafx.base
}
module com.infoq.monitor {
requires com.infoq.monitor.api;
requires java.logging;
// no packages to export
}
Services
So far we have covered dependencies that are determined and
declared at compile time. A looser coupling can be achieved if a
dependency takes the form of a service, where one or more
modules provide a functionality, abstracted behind a single
type, and others consume instances of that type. The module
system can be used by the consumer to discover providers. This
implements the service locator pattern, where the module
system itself acts as the locator.
A service is a set of interfaces and (usually abstract) classes that
provide a cohesive feature. All involved types must be
accessible from a single type (likely an interface) so that one
can be used to load the service.
A service provider module contains one or more
implementations of a service. For each it includes a provides X
with Y; clause in its module descriptor, where X is the fully
208
qualified name of the service interface and Y the fully qualified
name of the implementing class. Y needs to have a public,
parameterless constructor so that the module system can
instantiate it.
The service consumer module has to read the service module
and include a uses X; clause in its descriptor. At runtime it can
then call ServiceLoader.load(Class<X>) and get a loader for the
service interface. The loader is iterable and contains instances
of all classes that implement X and were provided by service
provider modules.
When services are used as described here, the coupling is not
only loosened at compile time (because the consumer does not
declare dependencies on the providers). It is also loose at
runtime because the module system will create no read edge
from consumer to provider.
Another interesting aspect is that the set of available providers
for a service is defined by the module path (i.e. usually at
launch time). Exactly those observable modules that provide
implementations of a service will be available at runtime via the
service loader. So a systems behavior can be influenced by
editing the content of its module path and restarting it.
In our example we have a module com.infoq.monitor.stats,
which we said contacts the services running in our network and
creates statistics. This sounds like a lot of work for a single
module, but we can split it up.
Watching individual services is a uniform task so we create an
API for that and put it into the new
module com.infoq.monitor.watch. The service interface is
called com.infoq.monitor.watch.Watcher.
209
We are now free to create one or more modules with
implementations for concrete microservices. Let's call these
modules com.infoq.monitor.watch.login, com.infoq.monitor.wa
tch.shipping and so forth. Their module descriptors are as
follows:
module com.infoq.monitor.watch.login {
// we need the module defining the service we are providing;
// we imply readability so this module is usable on its own
requirespublic com.infoq.monitor.watch;
provides com.infoq.monitor.watch.Watcher
with com.infoq.monitor.watch.login.LoginWatcher;
}
Note that they only provide implementations of Watcher but
export no packages.
With all the code that contacts microservices carved out
of com.infoq.monitor.stats, we now have to make sure that it
still has access to that functionality. Its new module-info.java:
module com.infoq.monitor.stats {
requires java.logging;
requires com.infoq.monitor.watch;
// we have to declare which service we are depending on
uses com.infoq.monitor.watch.Watcher;
exports com.infoq.monitor.stats.get;
210
}
Somewhere in its code we can now do this:
List<Watcher> watchers =new ArrayList<>();
ServiceLoader.load(Watcher.class).forEach(watchers::add);
That produces a list containing one instance of each Watcher
implementation that was provided by modules on the module
path. From here on out the code would be similar to before, i.e.
contacting the services and creating statistics.
Limiting our module graph to com.infoq.monitor.stats and
below (because everything else is unchanged) our new version
looks as follows:
211
Fig 7.4 Module Graph with use of Services
Note how the arrows all point towards the new module; a
typical example for the dependency inversion principle.
212
CHAPTER 8 - MACHINE LEARNING USING SPARK
Getting Started
Spark is the hottest tool in big data analytics field. It is used by
more and more companies for machine learning and analytics
.People build a lot of enterprise applications using Apache
spark There is a lot os third party support for Spark .It is a fast
and general engine to perform big data processing in a simple
and scalable way. Its an open source cluster computing
framework. Its an end to end analytics framework facilitating
loading the data , processing the data , performing analytics etc.
Running on a single desktop or a cluster Spark is fine for
interactive or stream processing and supports java very well.
Spark is supported by third party reporting solutions .It is
centered around data processing , data integration and
interactive analytics , delivering high performance for machine
learning and analytics , real time stream processing .
Typical Spark workflow involves Loading the data from a data
source (HDFC ,real time ,cloud source),Transforming the data
(filter clean join enhance),Storing processed data (Memory
HDFS ,NO sq) and performing Interactive analytics .Its
application in machine learning involves Predictive analytics
.Spark coding is very concise .Spark is a general purpose
compute engine and is going to run in an embedded mode in the
whole process .More information on the same is available under
Spark.apache .org.
First Spark Program
The following is the snippet for te first Spark Program .
import java.util.Arrays;
import java.util.List;
213
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
public class YourFirstSparkProgram {
public static void main(String[] args) {
//Setup configuration
String appName = "V2 Maestros";
String sparkMaster = "local[2]";
//String sparkMaster = "spark://169.254.86.212:7077";
JavaSparkContext spContext = null;
SparkConf conf = new SparkConf()
.setAppName(appName)
.setMaster(sparkMaster);
//Create Spark Context from configuration
spContext = new JavaSparkContext(conf);
//Read a file into an RDD
JavaRDD<String> tweetsRDD =
spContext.textFile("data/tweets.csv");
//Print first five lines
for ( String s : tweetsRDD.take(5)) {
System.out.println(s);
}
//Print count.
214
System.out.println("Total tweets in file : " +
tweetsRDD.count());
/*
//Convert to upper case
JavaRDD<String> ucRDD = tweetsRDD.map(
str -> str.toUpperCase());
//Print upper case lines
for ( String s : ucRDD.take(5)) {
System.out.println(s);
}*/
while(true) {
try {
Thread.sleep(10000);
} catch (InterruptedException e) {
e.printStackTrace();
}
}
}
}
The context starts the embedded spark instance .We load the
data from the csv file into Resilient Data sets and then convert
the data into uppercase using a map function .When the session
is active we can go and see the website localhost:4040, it will
tell us what is happening in the spark website actually (this can
be done running a Thread in an infinite while loop).Spark is
written in scala which is derived from java .
215
Fig 8.1 Spark localhost:4040
Spark eco system
It includes the components and capabilities which spark
provides .Engine is the spark core .it is build on RDDs
Resilient distributed data set. We can manage Spark with
apache yarn or apache Mesos or use the inbuild spark scheduler.
Spark can be used as part of the Hadoop cluster or an stand
alone cluster .Spark can work with files in a local data store of
HDFS datastore or a RDBMS datastore or no sql datastores.
Spark provides connectors to all these data sources .Libraries on
top of spark core provides a lot of capabilities on top of spark
core .Spark SQL provides a SQL like interface for various spark
capabilities .Spark SQL is a primary interface you will be
working with in spark .ML library provide algorithms which
scale across the cluster and provide functions from regression to
nave bayes (supervised and unsupervised learning)
We can use GraphX to analyze graphs and streaming librarries
to get data from real time streams and process them and predict
216
in a real time manner.From programming point of view, it has
interfaces to scala, python, R and Java .Following is the picture
of the Spark Ecosystem.
Fig 8.2 Spark Framework
RDD
The entire world of spark is build on RDD .RDD stands for
resilient distributed dataset Dataset is collection of data ; it is
distributed in a no of pieces called partitions .They are
distributed across clusters .If a spark cluster has 4 nodes then
the RDD can be moved to 4 cluster nodes/partitions .They can
be operated in all these cluster nodes in parallel .Spark is built
around RDD. We can create, transform, analyze and store RDD
in a spark program.
Dataset is a collection of elements or any type- Strings, lines,
rows, objects ,collection,etc. It can be partitioned and
distributed across the multiple nodes . They are immutable and
cant be changed.We can only create and destroy a RDD and
217
cant change them .We can perform some transformations and
create new RDDs .They can be cached and persisted
.Transformations act on RDDs to create new RDDs. Actions
analyze RDDs to provide result . (Result which is not a RDD
but a result likeavg, max, min)
Internals of Spark
Spark executes various tasks which we give it to it .There is a
master node and worker node .In the master node we run a
driver program and we create a spark context inside it .Context
is a connection objects and its a gate way to all the spark
functionality .Worker nodes are slave nodes and work based on
the instructions they get from the master.
Fig 8.3 Spark Architecture
Spark executor programs are slaves which are waiting
instructions from the master .The executor and executor
program are connected via Cluster Manager .The cluster
manager can be the spark scheduler or apache yarn
Suppose we want to create a RDD for 5 records ,then the
context tells the cluster manager and the cluster manager
divides the work among the executors spark will partition
218
first 2 records in partition 1 and 3-5 records into the second
partition . They will be moved and stored within the executors
.RDD thus created will be moved to the worker nodes .
In case we want to execute the transformation job, that job is
split into jobs on the executors through the cluster manager
.Each executor does that job on the locally stored RDD on the
worked nodes .The worker nodes work in parallel and
independently of each other .On worker nodes the executors run
the tasks locally .When the results are retrieved, the RDDs
locally are joint and given back to the driver program .RDDs
are partitioned and stored locally on the worker nodes.
The function of the driver program and Spark Context is
summarized below :
219
We can run spark in a batch mode. (job basis) or Interactive
mode (similar to an interactive shell)
We can also run this in a cluster or streaming mode .The
different modes of Spark are summarized below :
We can run it as a single JVM or as a managed cluster as far as
scalability is concerned.
Spark project work flow
An analytics project is structured .We have some Raw data to
get answers to our question of Predictive analytics .(like People
220
visited which page , which product they bought, which page
they spent most time on.etc)So we acquire data ,process
data(cleanse ,filter, augment) based on the questions and
perform exploratory data analysis .We do some Machine
learning work like correlation analysis , training and test split ,
model building , prediction ,etc and based no this we take
action.Feedback will be provided by the actions and then we
correct the workflow to make the process better .RDD can be
created in a no of ways and operate in a Lazy Evaluation mode.
(It means Spark will not load or transfer the data unless an
action is performed)
They can be created from Text files, JSON Files or apply
parallelize on local collections. (Native to that program
language) local data structure Or from HDFS sequence files.
From RDBMS we can load the data into the local collection and
then create the RDD
If the data is very large create HDFS files out of spark using
apache Sqoop and then create RDD from them .Spark
understands the Hadoop partitions very well .storing the RDD
directly to text files , json ,sequence files or collections .Spark
provides simple functions to persist RDD to a variety of data
sinks Text Files ,JSON ,Sequence Files and collections
For optimization use language specific libraries for persistence
post the data is moved to data structures than using spark
utilities .Spark specializes in processing data rather than loading
and storing data.
SaveastEXTFILE () saves the RDD as a Text file
221
Spark OperationsSnippet
1. Get the spark context .This is the main entry point in the
spark activity
publicstaticJavaSparkContextgetContext() {
if ( spContext == null ) {
getConnection();
}
returnspContext;
}
2. Give the Spark master node (2 partitions by default ) and
appname and Create the conf object Post this spark gets
started
if ( spContext == null) {
//Setup Spark configuration
SparkConfconf = newSparkConf()
.setAppName(appName)
.setMaster(sparkMaster);
System.setProperty("hadoop.home.dir",
"c:\\spark\\winutils\\");
//Create Spark Context from configuration
spContext = newJavaSparkContext(conf);
}
222
3. We create spark session for spark sql
sparkSession = SparkSession
.builder()
.appName(appName)
.master(sparkMaster)
.config("spark.sql.warehouse.dir", tempDir)
.getOrCreate();
4. Creating a RDD in Spark from a Java List
publicstaticJavaRDD<Integer>getCollData() {
JavaSparkContextspContext =
SparkConnection.getContext();
List<Integer>data = Arrays.asList(3,6,3,4,8);
JavaRDD<Integer>collData =
spContext.parallelize(data);
collData.cache();
returncollData;
Spark context parallelize and will convert the list into JavaRDD
(it makes number of partitions based on the no of cores)
There are 2 imports
importorg.apache.spark.api.java.JavaRDD;
importorg.apache.spark.api.java.JavaSparkContext;
223
5. Till the time we perform a count on the RDD , RDD will
not be created
//1
JavaRDD<Integer>collData = DataResource.getCollData();
//Lazy-evaluation. creation happens only during action.
//2
System.out.println("Total records = " + collData.count());
The RDD creation will happen only at step no 2 .
RDD can be created from a file as well
JavaRDD<String>autoAllData
= spContext.textFile("data/auto-data.csv");
Every line in the file is stored as a String object and all the
string objects are stored in the RDD.
Take() function is used to take and print a specific number of
lines of the RDD
for ( String s : stringRDD.take(count)) {
System.out.println(s);
6. The below code splits the original csv and store them as
partitions saveAsTextFile() will save each partition into
individual files
JavaRDD<String>autoAllData
= spContext.textFile("data/auto-data.csv");
224
autoAllData.saveAsTextFile("data/auto-data-modified.csv");
7. RDDs can be converted back to Java lists using collect ()
Individual partitions are collected and the master node
will return the same
8. We can apply a transformation on RDD , we will end up
creating a new RD since Lazy evaluation is applied
9. Until we apply an action on the resultant RDD .can be
distributed across multiple nodes based on the partitions they
act upon .spark acts on each partition independently and
each have its memory and CPU and returns the result
collated back The resultant RDD is then partitioned in the
same way as the input RDD
Map function
The map function can be described as under :
NewRDD = RDD. Map (function)
RDD works similar to a map reduce map.It acts upon each
element and perform some operation and returns new Object
.Result RDD may have the same number of elements as original
RDD
JavaRDD<String>autoAllData
= spContext.textFile("data/auto-data.csv");
JavaRDD<String>tsvData = autoAllData
.map(str ->str.replace(",", "\t"));
225
It will return the tsvdData RDD where in each comma is
replaced with a tab character .Result can be of different type
and the data can be different -only thing is the number of
records will be the same .We can use inline function or our own
function.We can use for data standardization , data type
conversion ,element level computations like computing tax ,and
adding new attributes , data checking ,
We can write the function into a different class altogether
.implementing Function interface which is part of Spark
FlatMap
A flatMap does everything a map does but can return more
number of elements than the original map. (for every source
records). Example could be breaking up scenarios in the
original map and creating new maps like Extract child elements
from a nested json string .
newRdd = rdd.flatMap(function)
It takes every object and return multiple objects ot takes every
line in the RDD and break them into works and return the
words
JavaRDD<String>words
=
toyotaData[TestRDD].flatMap(newFlatMapFunction<String,
String>() {
public Iterator<String> call(String s) {
returnArrays.asList(s.split(",")).iterator();
}
});
226
The first parameter is input datatype and second is output data
type We need to implement call () and return value is iterator
of string
Filter
Its a very common operation .Syntax usage is as below .
newRdd= Rdd.filter (function)
Its an inline or custom function which returns a Boolean true or
false .The ResultRDD will be smaller than the original RDD.A
function can be passed as a condition to perform a complex
filtering .in case the inline function returns true , the filter
operation will pass it to the new RDD else it will not .
JavaRDD<String>autoAllData
= spContext.textFile("data/auto-data.csv");
// For filtering the header
JavaRDD<String>autoData
= autoAllData.filter(s -> !s.equals(header));
// filtering for lines that will have text as toyota
JavaRDD<String>toyotaData
= autoData.filter(str ->str.contains("toyota"));
// printing distinct data
JavaSparkContextspContext = SparkConnection.getContext();
List<Integer>data = Arrays.asList(3,6,3,4,8);
227
JavaRDD<Integer>collData = spContext.parallelize(data);
System.out.println("Spark Distinct Example : "
+
collData.distinct().collect());
Set operations
Union returns a new dataset that contains the union of the
elements in the source dataset and the arguments .Syntax is as
follows.
UnionRDD= firstRDD.union(secondRDD)
Common records will not be duplicated .We can Merge 2
datasets from 2 months as an example .
Its a row level merging not a column level merging .
Intersection returns a new RDD that contains the intersection
of elements in the source dataset and the argument .Syntax is as
follows.
IntersectionRDD = firstRDD.intersect(secondRDD)
JavaRDD<String>words1 = spContext.parallelize(
Arrays.asList("hello", "war", "peace", "world"));
JavaRDD<String>words2 = spContext.parallelize(
Arrays.asList("war", "peace", "universe"));
System.out.println("Example for Set operations : Union:");
ExerciseUtils.printStringRDD(words1.union(words2), 10);
228
System.out.println("Example for Set operations : Intersection");
ExerciseUtils.printStringRDD(words1.intersection(words2),10);
Actions
Actions acts on the entire RDD and produces a consolidated
result (not a RDD)
Spark does not do nay evaluation until it hits an action (like
loading or transformation)
Most common actions :
Collect returns all the elements in the RDD as an array .it
collects all the data back from individual nodes to the local
programming array where the driver program is running
Count returns the no of elements in the RDD
Take(n) returns the first n elements in the RDD
First() returns the first element of the RDD
Reduce operation does some operation across all elements in
the RDD
Eg .sum .The operation is a function that takes 2 input values .
The reduce function works recursively as below:
229
The important thing is to take care of data type return data
type. Following is the snippet of the reduce function .
JavaSparkContext spContext = SparkConnection.getContext();
List<Integer> data = Arrays.asList(3,6,3,4,8);
JavaRDD<Integer> collData = spContext.parallelize(data);
int collCount = collData.reduce((x, y) -> x + y);
System.out.println("Compute sum using reduce " + collCount);
Pair RDD
Special kind of RDD s that can store Key Value pairs .They
can be created through map operations on regular RDDs also
.All transformations for regular RDD applies to map RDD.
Popular functions for pair RDD are map values and
flatMapValues .For e.g. mapValues transform each value
without changing the key .Thus we can calculate the tax for
sales amount under the same key .flatMapValues can generate
multiple values with the same key.for eg a value to a key is a
list then flat map will creates multiple values with the same key.
There are certain actions defined for pair RDDs
countByKey = produces a count by each key in the RDD
groupByKey=perform aggregation by key
reduceByKey=perform reduce by key
aggregateByKey =perform aggregate by key
230
Advanced Spark
Spark makes a copy of the code in every node of the Spark
cluster .These copies run independently of each other .Thus a
global variable is also local to a particular node .Thus there will
be copies of the local variable on each node of the cluster .We
need to use Broadcast variable which is a read only variable
shared across all the nodes .Spark optimizes distribution and
storage for better performance .Accumulators are also shared
across all the nodes but they can be updated and Spark will take
care of updating it centrally .These are used as global variables
for counting .
Partioning
A RDD is broken down and distributed across all the nodes .By
default the number of partitions is equal to the no of cores on
the entire cluster .For very large clusters ,we need to configure
it .using spark .default.parallelism parameter .It can be done
during the RDD creation .
Persistance
By default spark creates a RDD only when it is required .It will
load and compute the RDD chain each time an operation is
performed .When the operation completes , it will drop all the
data since million of records cant be kept in memory for long.
We can force Spark to persist RDD using persist () or cache ()
so that we dont have to re-cache the RDD.
Lets say we calculate no of Sedan Vehicles in a RDD
containing vehicles .We declare an accumulator to store the
counts and a broadcast variable to store the kind of String to
looks for in this case its sedan
231
JavaSparkContext spContext = SparkConnection.getContext();
LongAccumulator sedanCount =
spContext.sc().longAccumulator();
Broadcast<String> sedanText = spContext.broadcast("sedan");
JavaRDD<String> autoOut
= autoData.map(new Function<String, String>() {
public String call(String x) {
if
(x.contains(sedanText.value())) {
sedanCount.add(1);
}
return x;
}
});
// to triggrer the map function
autoOut.count();
When we invoke the function getNumPartitions() it prints out
the default no of partitions which in this case is 2 .But we can
explicitly mention the number of partitions using parallelize
operation .The localhost console on port 4040 is also pasted
below for reference .
autoData.cache();
System.out.println("No. of partitions in
autoData = " +
autoData.getNumPartitions());
232
JavaRDD<String> wordsList =
spContext.parallelize(
Arrays.asList("hello", "to",
"Spark", "world"), 4);
System.out.println("No. of partitions in
wordsList = " +
wordsList.getNumPartitions());
Fig 8.4 Spark localhost:4040describing parallel connection
RDB
Spark SQL
Spark SQL is build on top of Spark and RDD to perform on
Data and operations .It makes it easy for SQL programmers to
migrate to big data world .It works with structured data which
has schema .It supports JDBC .Similar to a context , Spark
session is used for all the operations done via Spark SQL.Data
frames are there for Spark Session like RDD s are there for
Spark Core .A dataframe is a distributed collection of data
organized in rows and columns .It has data types , column
names and schema .Data frames are very similar to DB table
.We can create a data frame from JSON,RDD,CSV and DB
233
tables. We can perform operations like filter, joins, groupby,
aggregate and the regular map and reduce operations on
dataframes. Below we will load a Dataframe and perform
operations using Spark SQL.we can convert a dataframe into a
RDD anytime .We can also construct a dataframe from the Java
Objects,RDD,CSV files .
private static SparkSession sparkSession = null;
private static String tempDir = "file:///c:/temp/spark-
warehouse";
sparkSession = SparkSession .builder() .appName(appName)
.master(sparkMaster).config("spark.sql.warehouse.dir",
tempDir) .getOrCreate();
//We will get the dataframe using json.
Dataset<Row> empDf =
sparkSession.read().json("data/TradeData.json");
// to see the dataframe content
empDf.show();
// to display the schema
empDf.printSchema();
// select operation
empDf.select(col("name"),col("salary")).show();
// filter method
empDf.filter(col("age").equalTo(40)).show();
// grouping of data
empDf.groupBy(col("gender")).count().show();
// join condition
234
Dataset<Row> joinDf = empDf.join(deptDf,
col("deptid").equalTo(col("id")));
The console output is as shown below :
235
Spark Streaming
Till now we have been dealing with data at rest .There
are many usecases which need real time analytics .Spark
streaming takes the data as it is coming in batches and perform
predictive analytics and apply other machine learning
techniques over it .We use this in credit card fraud detection,
Spam filtering ,network intrusion detection , real time social
media analytics , click stream analytics and recommendations
,stock market analytics , etc .Spark support flat files ,TCP/IP
port ,Apache Flume , Apache Kafka , Apache Kinesis, social
media , etc as part of the data sources .Architecture wise , the
streaming context lies inside the Spark Context .Streaming
creates a DStream or a discretized stream on which the
processing occurs .We can attach a number of map and reduce
operations on these DStreams.A micro batch window is set up
for a DStream.Data coming from the DStream gets accumulated
as a batch and process it at one time .Each micro Batch is a
RDD.Windowing functions are also available on DStrems. One
of the worker nodes is assigned the job of listening to the source
.For that Spark is going to create a Long Task and a receiver
which keeps listening to the input source .The receivers then
propagates the data to all the worker nodes .The normal tasks in
the worker node acts upon the data and produces results like
they act in a regular RDD.
236
Fig 8.6 Spark Streaming
Machine Learning with Spark
We will learn machine learning algorithms applied to
predictive analytics. Analytics means making inferences and
taking actions on data .The below table shows the types of
analytics possible with machine learning.
In exploratory analytics, we want to understand the
predictors and the targets in the dataset and what are the relation
or correlation between these variables .It is used to uncover
pattern and trends .It is also used to find key variables and
eliminate unwanted variables and detect outliners .We come to
know in case the previous data ingestion process has any
mistakes .It is also used to test assumptions and hypothesis .The
tools used for EDA are correlation matrices, Boxplots ,Scatter
Plots ,Principal component analysis and histograms .The data is
very central to the whole idea and contains a lot of attributes
237
.These attributes show a lot of relationships between the entities
.The process of learning for understanding the relationship
between the data and using a computer to do the needful
comprises of machine learning .Using machine learning we
build a model which could be a decision tree ,etc .These model
can be used to group similar data or predict the outcome .Since
machines only understand numbers ,text data needs to be
converted to equivalent numerical representations for ML
algorithms to work .
Supervised and unsupervised learning
The difference between the two is in supervised
learning we have a target variable and we predict the target
variable .In unsupervised learning we group the data based on
similarity and there is no target variable .Similarity is derived
based on distances , presence or absence of values .The types of
unsupervised learning are clustering ,association rules mining
,and collaborative filtering .In supervised learning we predict an
unknown variable (outcome ) based on the known attributes
(predictors ) for an entity. Models here are build using training
data and its then used to predict the future outcomes .The types
are basically regression (continuous outcome values) and
classification (outcome classes).The supervised learning is
depicted below :
Fig 8.7 Supervised Learning
238
As a best practice, we go with a 70-30 split for the
training and testing data.When we want to compare the results
for training data and testing data, we prepare what is known as a
confusion matrix wherein we plot the predictions against the
actuals of the test data .The confusion matrix has the prediction
types which are depicted below.
Fig 8.8 Confusion Matrix
The more accurate the algorithm is , the less will be
false positive and false negatives it will generate .False
negatives may not be acceptable in the medical field and
similarly false positives will not be acceptable in the judicial
field.There are 2 kinds of prediction errors Bias and variance .
Bias happens when model skews itself to certain
aspects of the predictors, while ignoring others .Variance refers
to the stability of the model .The following is demonstrated
with the help of a diagram .
239
Fig 8.9 Variation and Bias
Types of errors :
1. In-Sample Error-it is the prediction error when the model is
used to predict on the training set it is built upon.
2. Out-of-the-sample error is the prediction error when the
model us used to predict on a new data set .
3. Over fitting refers to a situation where the model has very
low in-sample errors but very high out-of-the-sample errors.
Machine Learnin types with Spark
Spark makes ML practical and easy .Spark has 2 libraries
spark.mllib and Spark.ml .Spark.ml is build using data frames
and pipelines.We need to be aware about 2 Data types Local
vector and Labeled Point.
Local vector is a vector of double vector and it has non zero
values most of the time .Thus it is also called as dense vector
.Sparse vector on the other hand can have zeros in them
.Labeled point is a Data point in ml.It which contains a label
(the target variable )and a list of features.(predictors).Pipelines
is the pipe build around transformations and actions that need to
be performed to create a model .we can create a pipeline object
specifying the list of activities using DataFrames,Transformers
240
,Estimators,and parameters .Spark better parallelize and utilize
resources using pipelines .
-Linear Regression
It is a very powerful method to analyse the relationship between
multiple variables .Here we estimate the value of dependent
variables from the value of independent variables using a
relationship equation.
The relationship equation is the model in linear regression .It is
used when the dependent and independent variables are
continuous and have some correlation .Here we are trying to
predict a number than some sort of classification .Goodness of
fit is important .A linear equation is one where we can expalin
the relation between x and y using an equation y = ax + b .Here
determining the values of a ,b we can predict y using the value
for x .a is the slope here .and b is the y intercept of the line .The
model building determines a and b .The model makes use of the
fitting the line concept which means that given the scatter plot
of y versus x , fit a straight line through the points so that the
sum of the vertical distances between the points and the line is
minimized.
Fig 8.10 Linear Regression
241
The best line is where the residuals is the least .R squared
measures .Figure how close the data is to the fitted line .R-
squared is between 0 -1 .The higher the value the better the fit
.Higher correlation leads to a better fit .
-Multiple Regression
When there is more than one independent variables or
predictors that is used to predict a dependent variable .The
equation will be like y = b + px1 + q x2 + rxn
This will be multi dimensional plot .Same process of prediction
holds good here as well.
Fig 8.11 Multiple Regression
Regression is a very important form of supervised learning
technique where predictors are provided as inputs and the
model is the linear equations coefficients ,intercept and R-
Squared .It is very old, low in cost and popular system for
building models with continuous variables .But this cant be
used for non-linear /fuzzy relationships .It is also very sensitive
to outliners .
Lets look onto a example :
/*---------------------------------------------------------
-----------------
Load Data
242
---------------------------------------------------------
-----------------*/
Dataset<Row> autoDf = spSession.read()
.option("header","true")
.csv("data/TestCsvWithAutomobilesData.csv");
autoDf.show(5);
autoDf.printSchema();
/*-------------------------------------------------------
-------------------
Cleanse Data
---------------------------------------------------------
-----------------*/
//Convert all data types as double; Change
missing values to standard ones.
//Create the schema for the data to be loaded
into Dataset.
StructType autoSchema = DataTypes
.createStructType(new
StructField[] {
DataTypes.createStructField("MPG",
DataTypes.DoubleType, false),
243
DataTypes.createStructField("CYLINDERS",
DataTypes.DoubleType, false),
DataTypes.createStructField("DISPLACEMENT",
DataTypes.DoubleType, false),
DataTypes.createStructField("HP",
DataTypes.DoubleType, false),
DataTypes.createStructField("WEIGHT",
DataTypes.DoubleType, false),
DataTypes.createStructField("ACCELERATION",
DataTypes.DoubleType, false),
DataTypes.createStructField("MODELYEAR",
DataTypes.DoubleType, false),
DataTypes.createStructField("NAME",
DataTypes.StringType, false)
});
// Removing of special characters ?
Broadcast<Double> avgHP = spContext.broadcast(80.0);
//Change data frame back to RDD
JavaRDD<Row> rdd1 = autoDf.toJavaRDD().repartition(2);
//Function to map.
JavaRDD<Row> rdd2 = rdd1.map( new
Function<Row, Row>() {
244
@Override
public Row call(Row iRow) throws Exception {
double hp = (iRow.getString(3).equals("?") ?
avgHP.value()
: Double.valueOf(iRow.getString(3)));
Row retRow =
RowFactory.create( Double.valueOf(iRow.getString(0)),
Double.valueOf(iRow.getString(1)),
Double.valueOf(iRow.getString(2)),
Double.valueOf(hp),
Double.valueOf(iRow.getString(4)),
Double.valueOf(iRow.getString(5)),
Double.valueOf(iRow.getString(6)),
iRow.getString(7)
);
return retRow;
}
});
//Create Data Frame back.
245
Dataset<Row> autoCleansedDf =
spSession.createDataFrame(rdd2, autoSchema);
System.out.println("Transformed Data :");
autoCleansedDf.show(5);
/*----------------------------------------------------------
----------------
Analyze Data
---------------------------------------------------------
-----------------*/
//Perform correlation analysis
for ( StructField field : autoSchema.fields() ) {
if ( !
field.dataType().equals(DataTypes.StringType)) {
System.out.println(
"Correlation between MPG and " + field.name()
+ " = " +
autoCleansedDf.stat().corr("MPG", field.name()) );
}
}
/*-------------------------------------------------------
-------------------
Prepare for Machine Learning.
---------------------------------------------------------
-----------------*/
//Convert data to labeled Point structure
246
JavaRDD<Row> rdd3 =
autoCleansedDf.toJavaRDD().repartition(2);
JavaRDD<LabeledPoint> rdd4 = rdd3.map(
new Function<Row, LabeledPoint>() {
@Override
public LabeledPoint call(Row iRow)
throws Exception {
LabeledPoint lp = new
LabeledPoint(iRow.getDouble(0) ,
Vectors.dense(iRow.getDouble(2),
iRow.getDouble(4),
iRow.getDouble(5)));
return lp;
}
});
Dataset<Row> autoLp = spSession.createDataFrame(rdd4,
LabeledPoint.class);
autoLp.show(5);
// Split the data into training and test sets (10% held out for
testing).
Dataset<Row>[] splits =
autoLp.randomSplit(new double[]{0.9, 0.1});
Dataset<Row> trainingData = splits[0];
Dataset<Row> testData = splits[1];
247
/*-------------------------------------------------------
-------------------
Perform machine learning.
---------------------------------------------------------
-----------------*/
//Create the object
LinearRegression lr = new LinearRegression();
//Create the model
LinearRegressionModel lrModel =
lr.fit(trainingData);
//Print out coefficients and intercept for LR
System.out.println("Coefficients: "
+ lrModel.coefficients() + "
Intercept: " + lrModel.intercept());
//Predict on test data
Dataset<Row> predictions =
lrModel.transform(testData);
//View results
predictions.select("label", "prediction",
"features").show(5);
//Compute R2 for the model on test data.
RegressionEvaluator evaluator = new
RegressionEvaluator()
.setLabelCol("label")
.setPredictionCol("prediction")
248
.setMetricName("r2");
double r2 = evaluator.evaluate(predictions);
System.out.println("R2 on test data = " + r2);
// Keep the program running so we can
checkout things.
ExerciseUtils.hold();
}
-Decision Trees
It is a easy and fast ML technique where we use the predictor
variables to build a decision tree.Trees start with a root node
that start the decision making process .Branch nodes refines the
decision process .Leaf nodes provide the decisions .Training
data is used to build a decision tree to predict the target.It is
mostly used for classification .The tree is the model here that is
used to predict the target .Following is an example of decision
tree .
Fig 8.12 Decision Trees
249
The challenge as to what to use in the root node and the number
of levels of the decision tree is taken care by the ML
Algorithms .Predictors with high selectivity gives very fast
results .It also works very well with missing data .It is sensitive
to local variations .The disadvantages are its accuracy is limited
,bias builds up pretty quickly and its not good with large
number of variables .Typically it is used in credit approvals and
preliminary categorization .
The following is a code piece to demonstrate the decision tree
algorithm
/*---------------------------------------------------------
-----------------
Load Data
---------------------------------------------------------
-----------------*/
Dataset<Row> irisDf = spSession.read()
.option("header","true")
.csv("data/iris.csv");
irisDf.show(5);
irisDf.printSchema();
/*-------------------------------------------------------
-------------------
Cleanse Data
---------------------------------------------------------
-----------------*/
//Convert all data types as double; Change
missing values to standard ones.
250
//Create the schema for the data to be loaded
into Dataset.
StructType irisSchema = DataTypes
.createStructType(new StructField[] {
DataTypes.createStructField("SEPAL_LENGTH",
DataTypes.DoubleType, false),
DataTypes.createStructField("SEPAL_WIDTH",
DataTypes.DoubleType, false),
DataTypes.createStructField("PETAL_LENGTH",
DataTypes.DoubleType, false),
DataTypes.createStructField("PETAL_WIDTH",
DataTypes.DoubleType, false),
DataTypes.createStructField("SPECIES",
DataTypes.StringType, false)
});
//Change data frame back to RDD
JavaRDD<Row> rdd1 =
irisDf.toJavaRDD().repartition(2);
//Function to map.
JavaRDD<Row> rdd2 = rdd1.map( new
Function<Row, Row>() {
@Override
public Row call(Row iRow) throws Exception {
251
Row retRow = RowFactory.create(
Double.valueOf(iRow.getString(0)),
Double.valueOf(iRow.getString(1)),
Double.valueOf(iRow.getString(2)),
Double.valueOf(iRow.getString(3)),
iRow.getString(4)
);
return retRow;
}
});
//Create Data Frame back.
Dataset<Row> irisCleansedDf =
spSession.createDataFrame(rdd2, irisSchema);
System.out.println("Transformed Data :");
irisCleansedDf.show(5);
/*-------------------------------------------------------
-------------------
Analyze Data
---------------------------------------------------------
-----------------*/
//Add an index using string indexer.
StringIndexer indexer = new StringIndexer()
252
.setInputCol("SPECIES")
.setOutputCol("IND_SPECIES");
StringIndexerModel siModel =
indexer.fit(irisCleansedDf);
Dataset<Row> indexedIris =
siModel.transform(irisCleansedDf);
indexedIris.groupBy(col("SPECIES"),col("IND_SPECI
ES")).count().show();
//Perform correlation analysis
for ( StructField field : irisSchema.fields() ) {
if ( !
field.dataType().equals(DataTypes.StringType)) {
System.out.println(
"Correlation between IND_SPECIES and " + field.name()
+ " = " +
indexedIris.stat().corr("IND_SPECIES", field.name()) );
}
}
/*-------------------------------------------------------
-------------------
Prepare for Machine Learning.
---------------------------------------------------------
-----------------*/
//Convert data to labeled Point structure
253
JavaRDD<Row> rdd3 =
indexedIris.toJavaRDD().repartition(2);
JavaRDD<LabeledPoint> rdd4 = rdd3.map(
new Function<Row, LabeledPoint>() {
@Override
public LabeledPoint call(Row iRow)
throws Exception {
LabeledPoint lp = new
LabeledPoint(iRow.getDouble(5) ,
Vectors.dense(iRow.getDouble(0),
iRow.getDouble(1),
iRow.getDouble(2),
iRow.getDouble(3)));
return lp;
}
});
Dataset<Row> irisLp =
spSession.createDataFrame(rdd4, LabeledPoint.class);
irisLp.show(5);
// Split the data into training and test sets (30%
held out for testing).
254
Dataset<Row>[] splits =
irisLp.randomSplit(new double[]{0.7, 0.3});
Dataset<Row> trainingData = splits[0];
Dataset<Row> testData = splits[1];
/*-------------------------------------------------------
-------------------
Perform machine learning.
---------------------------------------------------------
-----------------*/
//Create the object
// Train a DecisionTree model.
DecisionTreeClassifier dt = new
DecisionTreeClassifier()
.setLabelCol("label")
.setFeaturesCol("features");
// Convert indexed labels back to original
labels.
IndexToString labelConverter = new
IndexToString()
.setInputCol("label")
.setOutputCol("labelStr")
.setLabels(siModel.labels());
IndexToString predConverter = new
IndexToString()
.setInputCol("prediction")
255
.setOutputCol("predictionStr")
.setLabels(siModel.labels());
DecisionTreeClassificationModel dtModel =
dt.fit(trainingData);
//Predict on test data
Dataset<Row> rawPredictions =
dtModel.transform(testData);
Dataset<Row> predictions =
predConverter.transform(
labelConverter.transform(rawPredictions));
//View results
System.out.println("Result sample :");
predictions.select("labelStr", "predictionStr",
"features").show(5);
//View confusion matrix
System.out.println("Confusion Matrix :");
predictions.groupBy(col("labelStr"),
col("predictionStr")).count().show();
//Accuracy computation
MulticlassClassificationEvaluator evaluator =
new MulticlassClassificationEvaluator()
.setLabelCol("label")
.setPredictionCol("prediction")
.setMetricName("accuracy");
256
double accuracy =
evaluator.evaluate(predictions);
System.out.println("Accuracy
= " + Math.round( accuracy * 100) + " %" );
// Keep the program running so we can
checkout things.
ExerciseUtils.hold();
}
-Dimensionality Reduction
The number of predictor variables are called as dimensions .A
lot of dimensions have memory requirements ,CPU
requirements ,and the time taken for machine learning
algorithm is huge .Correlation between the predictors might be
high which can influence the algorithm. There is also a chance
of over fitting .Manual selection of predictors might be risky
since there could be unknown correlations .We can take a call
based on correlation on the predictor variables and the target
variables or using decision tress .A scientific method is
principal component analysis .It is used to reduce the number of
predictors .It is based on Eigen values and Eigen vectors .Given
a set of M predictors ,PCA transforms this to a set of N
Predictors such that N < M. The new predictors are called as
derived predictors in the order of correlation .They overall can
explain a major portion of the variation in the target variable
.The new predictors retain similar level of correlation and
predictability .
-Random Forest
Random Forest is one of the most popular and accurate
algorithms .It is an ensemble method based on decision tress .In
257
ensemble decision making ,we build multiple models - a
decision tree model Each tree is used to predict an individual
result .A vote is taken on all the results to find the best answer
.Its in a way a collection of decision tress .Lets say the dataset
contains m samples and n predictors .We built x trees with a
different subset of data .For each tree ,a subset of m rows and n
columns are chosen randomly. For prediction ,new data is
passed to each of these x trees and x possible results are
obtained .The most found result is the aggregate prediction .Its a
highly accurate method and efficient with large number of
predictors .it is fully parallelizable .It is very good with missing
data .Disadvantage is its a time consuming and resource
consuming process. For categorical variables ,bias might exist if
levels are disproportionate. It is used in scientific research and
medical diagnosis. The following program snippet demonstrates
the PCA and random forest in action .
/*--------------------------------------------------------------------------
Load Data
---------------------------------------------------------
-----------------*/
Dataset<Row> bankDf = spSession.read()
.option("header","true")
.option("sep", ";")
.csv("data/bank.csv");
bankDf.show(5);
bankDf.printSchema();
258
/*-------------------------------------------------------
-------------------
Cleanse Data
---------------------------------------------------------
-----------------*/
//Convert all data types as double;
//Use Indicator variables
//Create the schema for the data to be loaded
into Dataset.
StructType bankSchema = DataTypes
.createStructType(new StructField[] {
DataTypes.createStructField("OUTCOME",
DataTypes.DoubleType, false),
DataTypes.createStructField("AGE",
DataTypes.DoubleType, false),
DataTypes.createStructField("SINGLE",
DataTypes.DoubleType, false),
DataTypes.createStructField("MARRIED",
DataTypes.DoubleType, false),
DataTypes.createStructField("DIVORCED",
DataTypes.DoubleType, false),
259
DataTypes.createStructField("PRIMARY",
DataTypes.DoubleType, false),
DataTypes.createStructField("SECONDARY",
DataTypes.DoubleType, false),
DataTypes.createStructField("TERTIARY",
DataTypes.DoubleType, false),
DataTypes.createStructField("DEFAULT",
DataTypes.DoubleType, false),
DataTypes.createStructField("BALANCE",
DataTypes.DoubleType, false),
DataTypes.createStructField("LOAN",
DataTypes.DoubleType, false)
});
//Change data frame back to RDD
JavaRDD<Row> rdd1 = bankDf.toJavaRDD().repartition(2);
//Function to map.
JavaRDD<Row> rdd2 = rdd1.map( new
Function<Row, Row>() {
@Override
public Row call(Row iRow) throws Exception {
260
//Convert age to float
double age =
Double.valueOf(iRow.getString(0));
//Convert outcome to float
double outcome =
(iRow.getString(16).equals("yes") ? 1.0: 0.0 );
//Create indicator variables for marital status
double single =
(iRow.getString(2).equals("single") ? 1.0 : 0.0);
double married =
(iRow.getString(2).equals("married") ? 1.0 : 0.0);
double divorced =
(iRow.getString(2).equals("divorced") ? 1.0 : 0.0);
//Create indicator variables for education
double primary =
(iRow.getString(3).equals("primary") ? 1.0 : 0.0);
double secondary =
(iRow.getString(3).equals("secondary") ? 1.0 : 0.0);
double tertiary =
(iRow.getString(3).equals("tertiary") ? 1.0 : 0.0);
//Convert default to float
double dflt =
(iRow.getString(4).equals("yes") ? 1.0 : 0.0);
261
//Convert balance to float
double balance =
Double.valueOf(iRow.getString(5));
//Convert loan to float
double loan =
(iRow.getString(7).equals("yes") ? 1.0 : 0.0);
Row retRow =
RowFactory.create( outcome, age, single, married, divorced,
primary,secondary, tertiary, dflt, balance, loan);
return retRow;
});
//Create Data Frame back.
Dataset<Row> bankCleansedDf =
spSession.createDataFrame(rdd2, bankSchema);
System.out.println("Transformed Data :");
bankCleansedDf.show(5);
/*-------------------------------------------------------
-------------------
Analyze Data
---------------------------------------------------------
-----------------*/
262
//Perform correlation analysis
for ( StructField field : bankSchema.fields() ) {
if ( !
field.dataType().equals(DataTypes.StringType)) {
System.out.println(
"Correlation between OUTCOME and " + field.name()
+ " = " +
bankCleansedDf.stat().corr("OUTCOME", field.name()) );
/*-------------------------------------------------------
-------------------
Prepare for Machine Learning.
---------------------------------------------------------
-----------------*/
//Convert data to labeled Point structure
JavaRDD<Row> rdd3 =
bankCleansedDf.toJavaRDD().repartition(2);
JavaRDD<LabeledPoint> rdd4 = rdd3.map(
new Function<Row, LabeledPoint>() {
@Override
public LabeledPoint call(Row iRow)
throws Exception {
263
LabeledPoint lp = new
LabeledPoint(iRow.getDouble(0) ,
Vectors.dense(iRow.getDouble(1),
iRow.getDouble(2),
iRow.getDouble(3),
iRow.getDouble(4),
iRow.getDouble(5),
iRow.getDouble(6),
iRow.getDouble(7),
iRow.getDouble(8),
iRow.getDouble(9),
iRow.getDouble(10)));
return lp;
}
});
264
Dataset<Row> bankLp =
spSession.createDataFrame(rdd4, LabeledPoint.class);
System.out.println("Transformed Label and
Features :");
bankLp.show(5);
//Add an index using string indexer.
StringIndexer indexer = new StringIndexer()
.setInputCol("label")
.setOutputCol("indLabel");
StringIndexerModel siModel = indexer.fit(bankLp);
Dataset<Row> indexedBankLp =
siModel.transform(bankLp);
System.out.println("Indexed Bank LP :");
indexedBankLp.show(5);
//Perform PCA
PCA pca = new PCA()
.setInputCol("features")
.setOutputCol("pcaFeatures")
.setK(3);
PCAModel pcaModel =
pca.fit(indexedBankLp);
265
Dataset<Row> bankPCA =
pcaModel.transform(indexedBankLp);
System.out.println("PCA'ed Indexed Bank LP
:");
bankPCA.show(5);
// Split the data into training and test sets (30%
held out for testing).
Dataset<Row>[] splits =
bankPCA.randomSplit(new double[]{0.7, 0.3});
Dataset<Row> trainingData = splits[0];
Dataset<Row> testData = splits[1];
/*-------------------------------------------------------
-------------------
Perform machine learning.
---------------------------------------------------------
-----------------*/
//Create the object
// Train a DecisionTree model.
RandomForestClassifier rf = new
RandomForestClassifier()
.setLabelCol("indLabel")
.setFeaturesCol("pcaFeatures");
266
// Convert indexed labels back to original
labels.
IndexToString labelConverter = new
IndexToString()
.setInputCol("indLabel")
.setOutputCol("labelStr")
.setLabels(siModel.labels());
IndexToString predConverter = new IndexToString()
.setInputCol("prediction")
.setOutputCol("predictionStr")
.setLabels(siModel.labels());
RandomForestClassificationModel rfModel =
rf.fit(trainingData);
//Predict on test data
Dataset<Row> rawPredictions =
rfModel.transform(testData);
Dataset<Row> predictions =
predConverter.transform(
labelConverter.transform(rawPredictions));
//View results
267
System.out.println("Result sample :");
predictions.select("labelStr", "predictionStr",
"features").show(5);
//View confusion matrix
System.out.println("Confusion Matrix :");
predictions.groupBy(col("labelStr"),
col("predictionStr")).count().show();
//Accuracy computation
MulticlassClassificationEvaluator evaluator =
new MulticlassClassificationEvaluator()
.setLabelCol("indLabel")
.setPredictionCol("prediction")
.setMetricName("accuracy");
double accuracy = evaluator.evaluate(predictions);
System.out.println("Accuracy
= " + Math.round( accuracy * 100) + " %" );
// Keep the program running so we can
checkout things.
ExerciseUtils.hold();
268
-Nave Bayes
Bayes theorem gives the conditional probability of an event A
given event B has already occurred .Thus
P(A/B)=P(A intersect B) * P(A) / P(B)
Nave Bayes classification is the application of Bayes Theorem
.The target variable becomes the event A. The predictors
become the prior events (B1- Bn),We try to find P (A/B1-
Bn).The model generated stores the conditional probability of
the target of every possible value of the predictor. When a new
prediction need to be done ,the conditional probabilities are
applied using Bayes formula to find the probability .It is simple
and fast ,works well with missing and noisy data and provides a
high level of prediction since it provides the probabilities of the
result .It works very well with categorical data .It however
expects the predictors to be independent .(it makes a nave
assumption that the correlation coefficients of the inputs is low
).Its not good with large numeric variables .It is used in Medical
diagnosis ,spam filtering and document classification .The
snippet for Nave Bayes is given below :
/*-----------------------------------------------------------------
---------
Load Data
---------------------------------------------------------
-----------------*/
//Create the schema for the data to be loaded
into Dataset.
StructType smsSchema = DataTypes
269
.createStructType(new StructField[] {
DataTypes.createStructField("label",
DataTypes.DoubleType, false),
DataTypes.createStructField("message",
DataTypes.StringType, false)
});
Dataset<Row> smsDf = spSession.read()
.csv("data/SMSSpamCollection.csv");
smsDf.show(5);
smsDf.printSchema();
/*-------------------------------------------------------
-------------------
Cleanse Data
---------------------------------------------------------
-----------------*/
//Cleanse Data and create a label/message
dataframe.
//Change data frame back to RDD
JavaRDD<Row> rdd1 =
smsDf.toJavaRDD().repartition(2);
//Function to map.
270
JavaRDD<Row> rdd2 = rdd1.map( new
Function<Row, Row>() {
@Override
public Row call(Row iRow) throws Exception {
double spam = (iRow.getString(0).equals("spam") ? 1.0
: 0.0);
Row retRow = RowFactory.create( spam,
iRow.getString(1) );
return retRow;
});
//Create Data Frame back.
Dataset<Row> smsCleansedDf =
spSession.createDataFrame(rdd2, smsSchema);
System.out.println("Transformed Data :");
smsCleansedDf.show(5);
/*-------------------------------------------------------
-------------------
Prepare for Machine Learning - setup pipeline
---------------------------------------------------------
-----------------*/
271
// Split the data into training and test sets (30%
held out for testing).
Dataset<Row>[] splits =
smsCleansedDf.randomSplit(new double[]{0.7, 0.3});
Dataset<Row> trainingData = splits[0];
Dataset<Row> testData = splits[1];
/*-------------------------------------------------------
-------------------
Perform machine learning. - Use the pipeline
---------------------------------------------------------
-----------------*/
Tokenizer tokenizer = new Tokenizer()
.setInputCol("message")
.setOutputCol("words");
HashingTF hashingTF = new HashingTF()
.setInputCol("words")
.setOutputCol("rawFeatures");
IDF idf = new IDF()
.setInputCol("rawFeatures")
.setOutputCol("features");
272
NaiveBayes nbClassifier = new NaiveBayes()
.setLabelCol("label")
.setFeaturesCol("features");
Pipeline pipeline = new Pipeline()
.setStages(new
PipelineStage[]
{tokenizer,
hashingTF, idf, nbClassifier});
PipelineModel plModel = pipeline.fit(trainingData);
Dataset<Row> predictions = plModel.transform(testData);
predictions.show(5);
//View results
System.out.println("Result sample :");
predictions.show(5);
//View confusion matrix
System.out.println("Confusion Matrix :");
predictions.groupBy(col("label"),
col("prediction")).count().show();
//Accuracy computation
273
MulticlassClassificationEvaluator evaluator =
new MulticlassClassificationEvaluator()
.setLabelCol("label")
.setPredictionCol("prediction")
.setMetricName("accuracy");
double accuracy =
evaluator.evaluate(predictions);
System.out.println("Accuracy
= " + Math.round( accuracy * 100) + " %" );
// Keep the program running so we can
checkout things.
ExerciseUtils.hold();
-K-means Clustering
It IS an unsupervised learning technique in which data is
grouped into subsets based on similarity.(similarity of predictor
variables).We partition n observations with m variables into k
clusters where by each observation belongs to only one cluster
.A m dimensional space is created and each observation is
plotted based on the variable values on this space .Clustering is
done by measuring the distance between the points and
grouping them .Multiple types of distance measures are
274
available like Euclidian distance and Manhattan distance .The
below diagram show the stages of clustering .
Fig 8.13 Stages of clustering
Advantages of k-means clustering are that it is fast ,efficient
with large number of variables, and explainable .Disadvantage
is we need to know the k upfront .The initial centroid position
do influence the formation of clusters .It is used in preliminary
grouping of data and general grouping .The below snippet
shows K-means clustering using Spark .
275
/*---------------------------------------------------------
-----------------
Load Data
---------------------------------------------------------
-----------------*/
Dataset<Row> autoDf = spSession.read()
.option("header","true")
.csv("data/auto-data.csv");
autoDf.show(5);
autoDf.printSchema();
/*-------------------------------------------------------
-------------------
Cleanse Data convert data type
---------------------------------------------------------
-----------------*/
//Create the schema for the data to be loaded
into Dataset.
StructType autoSchema = DataTypes
.createStructType(new
StructField[] {
DataTypes.createStructField("DOORS",
DataTypes.DoubleType, false),
DataTypes.createStructField("BODY",
DataTypes.DoubleType, false),
DataTypes.createStructField("HP",
DataTypes.DoubleType, false),
276
DataTypes.createStructField("RPM",
DataTypes.DoubleType, false),
DataTypes.createStructField("MPG",
DataTypes.DoubleType, false)
});
JavaRDD<Row> rdd1 = autoDf.toJavaRDD().repartition(2);
//Function to map.
JavaRDD<Row> rdd2 = rdd1.map( new Function<Row,
Row>() {
@Override
public Row call(Row iRow) throws Exception {
double doors = ( iRow.getString(3).equals("two") ? 1.0 : 2.0);
double body = ( iRow.getString(4).equals("sedan") ? 1.0 : 2.0);
Row retRow = RowFactory.create( doors, body,
Double.valueOf(iRow.getString(7)),
Double.valueOf(iRow.getString(8)),
Double.valueOf(iRow.getString(9)) );
return retRow;
}
});
//Create Data Frame back.
Dataset<Row> autoCleansedDf =
spSession.createDataFrame(rdd2, autoSchema);
System.out.println("Transformed Data :");
277
autoCleansedDf.show(5);
/*-------------------------------------------------------
-------------------
Prepare for Machine Learning - Perform
Centering and Scaling
---------------------------------------------------------
-----------------*/
Row meanRow =
autoCleansedDf.agg(avg(autoCleansedDf.col("DOORS")),
avg(autoCleansedDf.col("BODY")),
avg(autoCleansedDf.col("HP")),
avg(autoCleansedDf.col("RPM")),
avg(autoCleansedDf.col("MPG")))
.toJavaRDD().takeOrdered(1).get(0) ;
Row stdRow =
autoCleansedDf.agg(stddev(autoCleansedDf.col("DOORS")),
stddev(autoCleansedDf.col("BODY")),
stddev(autoCleansedDf.col("HP")),
stddev(autoCleansedDf.col("RPM")),
stddev(autoCleansedDf.col("MPG")))
.toJavaRDD().takeOrdered(1).get(0) ;
System.out.println("Mean Values : " +
meanRow);
278
System.out.println("Std Dev Values : " +
stdRow);
Broadcast<Row> bcMeanRow =
spContext.broadcast(meanRow);
Broadcast<Row> bcStdRow =
spContext.broadcast(stdRow);
DoubleAccumulator rowId =
spContext.sc().doubleAccumulator();
rowId.setValue(1);
//Perform center-and-scale and create a vector
JavaRDD<Row> rdd3 =
autoCleansedDf.toJavaRDD().repartition(2);
JavaRDD<LabeledPoint> rdd4 = rdd3.map(
new Function<Row, LabeledPoint>() {
@Override
public LabeledPoint call(Row iRow) throws Exception {
double doors = (bcMeanRow.value().getDouble(0) -
iRow.getDouble(0))
/ bcStdRow.value().getDouble(0);
double body =
(bcMeanRow.value().getDouble(1) - iRow.getDouble(1))
/ bcStdRow.value().getDouble(1);
double hp = (bcMeanRow.value().getDouble(2) -
iRow.getDouble(2))
/ bcStdRow.value().getDouble(2);
279
double rpm = (bcMeanRow.value().getDouble(3) -
iRow.getDouble(3))
/ bcStdRow.value().getDouble(3);
double mpg = (bcMeanRow.value().getDouble(4) -
iRow.getDouble(4))
/ bcStdRow.value().getDouble(4);
double id= rowId.value();
rowId.setValue(rowId.value()+1);
LabeledPoint lp = new LabeledPoint( id,
Vectors.dense( doors, body, hp, rpm, mpg));
return lp;
}
});
Dataset<Row> autoVector =
spSession.createDataFrame(rdd4, LabeledPoint.class );
System.out.println("Centered and scaled vector
:" + autoVector.count());
autoVector.show(5);
/*-------------------------------------------------------
-------------------
Perform Machine Learning
---------------------------------------------------------
-----------------*/
280
KMeans kmeans = new KMeans()
.setK(4)
.setSeed(1L);
KMeansModel model =
kmeans.fit(autoVector);
Dataset<Row> predictions =
model.transform(autoVector);
System.out.println("Groupings : ");
predictions.show(5);
System.out.println("Groupings Summary : ");
predictions.groupBy(col("prediction")).count().show();
// Keep the program running so we can
checkout things.
ExerciseUtils.hold();
}
281
CHAPTER 9 - GENETIC ALGORITHMS USING
JAVA
Getting started
Digital computers and the rise of the information age have
revolutionized the modern lifestyle. The invention of digital
computers has enabled us to digitize numerous areas of our
lives. This digitalization allows us to outsource many tedious
daily tasks to computers where previously humans may have
been required. An everyday example of this would be modern
word processing applications that feature built in spell checkers
to automatically check documents for spelling and grammar
mistakes.
As computers have grown faster and more computationally
powerful, we have been able to use them to perform
increasingly complex tasks such as understanding human
speech and even somewhat accurately predict the weather. This
constant innovation allows us to outsource a growing number of
tasks to computers. A present day computer is likely able to
execute billions of operations a second, but however technically
capable they become, unless they can learn and adapt
themselves to better suit the problems presented to them, theyll
always be limited to whatever rules or code us humans write for
them.
The field of artificial intelligence and the subset of genetic
algorithms are beginning to tackle some of these more complex
problems faced in todays digital world. By implementing
genetic algorithms into real world applications it is possible to
solve problems which would be nearly impossible to solve by
more traditional computing methods.
282
Artificial Intelligence
Many early computer scientists believed computers would not
only be able to demonstrate intelligent-like behavior, but that
they would achieve human level intelligence in just a few
decades of research. This notion is indicated by Herbert A.
Simon in 1965 when he declared, Machines will be capable,
within twenty years, of doing any work a man can do. Of
course now, over 50 years later, we know that Simons
prediction was far from reality, but at the time many computer
scientists agreed with his position and made it their goal to
create a strong AI machine. A strong AI machine is simply a
machine which is at least just as intellectually capable at
completing any task its given as humans.
When early computer scientists were first trying to build
artificially intelligent systems, they would frequently look to
nature for inspiration on how their algorithms could work. By
creating models which mimic processes found in nature,
computer scientists were able to give their algorithms the ability
to evolve, and even replicate characteristics of the human brain.
It was implementing their biologically-inspired algorithms that
enabled these early pioneers, for the first time, to give their
machines the ability to adapt, learn and control aspects of their
environments.
By using different biological analogies as a guiding
metaphor to develop artificially intelligent systems, computer
scientists created distinct fields of research. Naturally, the
different biological systems that inspired each field of research
have their own specific advantages and applications. One
successful field, and the one were paying attention to in this
book, is evolutionary computation - in which genetic algorithms
make up the majority of the research. Other fields focused on
283
slightly different areas, such as modeling the human brain. This
field of research is called artificial neural networks, and it uses
models of the biological nervous system to mimic its learning
and data processing capabilities.
Biological Evolution
Biological evolution, through the process of natural
selection, was first proposed by Charles Darwin (1859) in his
book, The Origin of Species. It was his concept of biological
evolution which inspired early computer scientists to adapt and
use biological evolution as a model for their optimization
techniques, found in evolutionary computation algorithms.
Because many of the ideas and concepts used in genetic
algorithms stem directly from biological evolution, a basic
familiarity with the subject is useful for a deeper understanding
into the field. With that being said, before we begin exploring
genetic algorithms, lets first run through the (somewhat
simplified) basics of biological evolution.
All organisms contain DNA which encodes all of the
different traits that make up that organism. DNA can be thought
of as lifes instruction manual to create the organism from
scratch. Changing the DNA of an organism will change its traits
such as eye and hair color. DNA is made up of individual genes,
and it is these genes that are responsible for encoding the
specific traits of an organism.
An organisms genes are grouped together in chromosomes
and a complete set of chromosomes make up an organisms
genome. All organisms will have a least one chromosome, but
usually contain many more, for example humans have 46
chromosomes with some species, having more than 1000! In
genetic algorithms we usually refer to the chromosome as the
284
candidate solution. This is because genetic algorithms typically
use a single chromosome to encode the candidate solution.
The various possible settings for a specific trait are called
an allele, and the position in the chromosome where that trait
is encoded is called a locus. We refer to a specific genome as
a genotype and the physical organism that genotype encodes
is called the phenotype.
When two organisms mate, DNA from both organisms are
brought together and combined in such a way that the resulting
organism usually referred to as the offspring acquires 50%
of its DNA from its first parent, and the other 50% from the
second. Every so often a gene from the organisms DNA will
mutate providing it with DNA found in neither of its parents.
These mutations provide the population with genetic diversity
by adding genes to the population that werent available
beforehand. All possible genetic information in the population
is referred as the populations gene pool.
If the resulting organism is fit enough to survive in its
environment its likely to mate itself, allowing its DNA to
continue on into future populations. If however, the resulting
organism isnt fit enough to survive and eventually mate its
genetic material wont propagate into future populations. This is
why evolution is occasionally referred to as survival of the
fittest only the fittest individuals survive and pass on their
DNA. Its this selective pressure that slowly guides evolution to
find increasingly fitter and better adapted individuals.
Terminology
Population - This is simply just a collection of candidate
solutions which can have genetic operators such as mutation
and crossover applied to them.
285
Candidate Solution A possible solution to a given problem.
Gene The indivisible building blocks making up the
chromosome. Classically a gene consists of 0 or a 1.
Chromosome A chromosome is a string of genes. A
chromosome defines a specific candidate solution. A typical
chromosome with a binary encoding might contain
something like, 01101011.
Mutation The process in which genes in a candidate
solution are randomly altered to create new traits.
Crossover The process in which chromosomes are
combined to create a new candidate solution. This is
sometimes referred to as recombination.
Selection This is the technique of picking candidate
solutions to breed the next generation of solutions.
Fitness A score which measures the extent to which a
candidate solution is adapted to suit a given problem.
Search Spaces
In computer science when dealing with optimization
problems that have many candidate solutions which need to be
searched through, we refer to the collection of solutions as a
search space. Each specific point within the search space
serves as a candidate solution for the given problem. Within this
search space there is a concept of distance where solutions that
are placed closer to one another are more likely to express
similar traits than solutions place further apart. To understand
how these distances are organized on the search space, consider
the following example using a binary genetic representation:
286
101 is only 1 difference away from, 111. This is
because there is only 1 change required (flipping the 0 to 1) to
transition from 101 to 111. This means these solutions are
only 1 space apart on the search space.
000 on the other hand, is three differences away from,
111. This gives it a distance of 3, placing 000 3 spaces from
111 on the search space.
Because solutions with fewer changes are grouped nearer to
one another, the distance between solutions on the search space
can be used to provide an approximation of the characteristics
held by another solution. This understanding is often used as a
tactic by many search algorithms to improve their search
results.
Fitness Landscapes
When candidate solutions found within the search space are
labeled by their individual fitness levels we can begin to think
of the search space as a fitness landscape.
Fig 9.1 Fitness landscape
On the bottom axis of our fitness landscape is the value
were optimizing for, and on the left axis is its corresponding
fitness value. I should note, this is typically an over
287
simplification of what would be found in practice. Most real
world applications have multiple values that need optimizing
creating a multi-dimensional fitness landscape.
In the above example the fitness value for every candidate
solution in the search space can be seen. This makes it easy to
see where the fittest solution is located, however, for this to be
possible in reality, each candidate solution in the search space
would have needed to have their fitness function evaluated. For
complex problems with exponential search spaces it just isnt
plausible to evaluate every solutions fitness value. In these
cases, it is the search algorithms job to find where the best
solution likely resides while being limited to only having a tiny
proportion of the search space visible.
Fig 9.2 Search Algorithm view
Consider an algorithm that is searching through a search
space of one billion (1,000,000,000) possible solutions. Even if
each solution only takes 1 second to evaluate and be assigned a
fitness value, it would still take over 30 years to explicitly
search through each potential solution! If we dont know the
fitness value for each solution in the search space then we are
unable to definitively know where the best solution resides. In
this case, the only reasonable approach is to use a search
algorithm capable of finding a good-enough, solution in the
288
time frame available. In these conditions, genetic algorithms
and evolutionary algorithms in general, are very effective at
finding feasible, near optimum solutions in a relatively short
time frame.
Genetic algorithms use a population approach when
searching the search space. As part of their search strategy
genetic algorithms will assume two well ranking solutions can
be combined to form an even fitter offspring. This process can
be visualized on our fitness landscape .
Fig 9.3 Offspring and Parents
The mutation operator found in genetic algorithms allows
us to search the close neighbors of the specific candidate
solution. When mutation is applied to a gene its value is
randomly changed. This can be pictured by taking a single step
on the search space.
289
Fig 9.4 Mutated Solution
In the example of both crossover and mutation it is possible
to end up with a solution less fit than what we originally set out
with .
Fig 9.5 Crossover Mutation
In these circumstances, if the solution performs poorly
enough, it will eventually be removed from the gene pool
during the selection process. Small negative changes in
individual candidate solutions are fine as long as the
populations average trend tends towards fitter solutions.
290
Local Optimums
An obstacle that should be considered when implementing
an optimization algorithm is how well the algorithm can escape
from locally optimal positions in the search space.
Fig 9.5 Local Optimum
Here we can see two hills on the fitness landscape which
have peaks of slightly different heights. As mentioned earlier,
the optimization algorithm isnt able to see the entire fitness
landscape, and instead, the best it can do is find solutions which
it believes are likely to be in an optimal position on the search
space. Its because of this characteristic the optimization
algorithm can often unknowingly focus its search on suboptimal
portions of the search space.
This problem becomes quickly noticeable when
implementing a simple hill climbing algorithm to solve
problems of any sufficient complexity. A simple hill climber
doesnt have any inherent method to deal with local optimums,
and as a result will often terminate its search in locally optimal
regions of the search space. A simple stochastic hill climber is
comparable to a genetic algorithm without a population and
crossover. The algorithm is fairly easy to understand, it starts
291
off at a random point in the search space, then attempts to find a
better solution by evaluating its neighbor solutions. When the
hill climber finds a better solution amongst its neighbors, it will
move to the new position and restart the search process again.
This process will gradually find improved solutions by taking
steps up whatever hill it found itself on in the search space
hence the name, hill climber. When the hill climber can no
longer find a better solution it will assume it is at the top of the
hill and stop the search.
Figure illustrates how a typical run-through of a hill climber
algorithm might look.
Fig 9.6 Hill Climber Algorithm
The diagram above demonstrates how a simple hill climber
algorithm can easily return a locally optimal solution if its
search begins in a locally optimal area of the search space.
Although there isnt any guaranteed way to avoid local
optimums without first evaluating the entire search area, there
are many variations of the algorithm which can help avoid local
optimums. One of the most basic and effective methods is
called random-restart hill climbing, which simply runs the hill
climbing algorithm multiple times from random starting
positions then returns the best solution found from its various
292
runs. This optimization method is relatively easy to implement
and surprisingly effective. Other approaches such as, Simulated
Annealing (see Kirkpatrick, Gelatt, and Vecchi (1983)) and
Tabu search (see Glover (1989) and Glover (1990)) are slight
variations to the hill climbing algorithm which both having
properties that can help reduce local optimums.
Genetic algorithms are surprisingly effective at avoiding
local optimums and retrieving solutions that are close to
optimal. One of the ways it achieves this is by having a
population that allows it to sample a large area of the search
space locating the best areas to continue the search.
Fig 9.6 Genetic Algorithm Process I
After a few generations have past, the population will begin
to conform towards where the best solutions could be found in
the previous generations. This is because less fit solutions will
be removed during the selection process making way for new,
fitter, solutions to be made during crossover and mutation.
293
Fig 9.7 Genetic Algorithm Process II
The mutation operator also plays a role in evading local
optimums. Mutation allows a solution to jump from its current
position to another position on the search space. This process
will often lead to the discovery of fitter solutions in more
optimal areas in the search space.
Mutation Rate
The mutation rate is the probability in which a specific gene
in a solutions chromosome will be mutated. There is
technically no correct value for the mutation rate of a genetic
algorithm, but some mutation rates will provide vastly better
results than others. A higher mutation rate allows for more
genetic diversity in the population and can also help the
algorithm avoid local optimums. However, a mutation rate
thats too high can cause too much genetic variation between
each generation causing it to lose good solutions it found in its
previous population.
If the mutation rate is too low, the algorithm can take an
unreasonably long time to move along the search space
hindering its ability to find a satisfactory solution. A mutation
rate thats too high can also prolong the time it takes to find an
acceptable solution. Although, a high mutation rate can help the
294
genetic algorithm avoid getting stuck in local optimums, when
its set too high it can have a negative impact on the search.
This, as was said before, is due to the solutions in each
generation being mutated to such a large extent that theyre
practically randomized after mutation has been applied.
To understand why a well configured mutation rate is
important, consider two binary encoded candidate solutions,
100 and 101. Without mutation new solutions can only
come from crossover. However, when we crossover our
solutions there are only two possible outcomes available for the
offspring, 100 or 101. This is because the only difference in
the parents genomes can be found in their last bits. If the
offspring receives its last bit from the first parent, it will be a
1, otherwise if its from the second, it would be a 0. If the
algorithm needed to find an alternative solution it would need to
mutate an existing solution, giving it new genetic information
that isnt available elsewhere in the gene pool.
The mutation rate should be set to a value that allows for
enough diversity to prevent the algorithm plateauing, but not so
much that it causes the algorithm to lose valuable genetic
information from the previous population. This balance will
depend on the nature of the problem being solved.
Population Size
The population size is simply the number of individuals in
the genetic algorithms population in any one generation. The
larger the populations size, the more of the search space the
algorithm can sample. This will help lead it in the direction of
more accurate, and globally optimal, solutions. A small
population size will often result in the algorithm finding less
desirable solutions in locally optimal areas of the search space,
295
however they require less computational resources per
generation.
Again here, like with the mutation rate, a balance needs to
be found for optimum performance of the genetic algorithm.
Likewise, the population size required will change depending
on the nature of the problem being solved. Large hilly search
spaces commonly require a larger population size to find the
best solutions. Interestingly, when picking a population size
there is a point in which increasing the size will cease to
provide the algorithm with much improvement in the accuracy
of the solutions it finds. Instead, it will slow the execution down
due to the extra computational demand needed to process the
additional individuals. A population size around this transition
is usually going to provide the best balance between resources
and results.
Crossover Rate
The frequency in which crossover is applied also has an
effect on the overall performance of the genetic
algorithm. Changing the crossover rate adjusts the chance in
which solutions in the population will have the crossover
operator applied to them. A high rate allows for many new,
potentially superior, solutions to be found during the crossover
phase. A lower rate will help keep the genetic information from
fitter individuals intact for the next generation. The crossover
rate should usually be set to a reasonably high rate promoting
the search for new solutions while allowing a small percentage
of the population to be kept unaffected for the next generation.
Genetic Representations
Aside from the parameters, another component that can
affect a genetic algorithms performance is the genetic
296
representation used. This is the way the genetic information is
encoded within the chromosomes. Better representations will
encode the solution in a way that is expressive while also being
easily evolvable. Hollands (1975) genetic algorithm was based
on a binary genetic representation. He proposed using
chromosomes that were comprised of strings containing 0s and
1s. This binary representation is probably the simplest encoding
available, however for many problems it isnt quite expressive
enough to be a suitable first choice. Consider the example in
which a binary representation is used to encode an integer
which is being optimized for use in some function. In this
example, 000 represents 0, and 111 represents 7, as it
typically would in binary. If the first gene in the chromosome is
mutated - by flipping the bit from 0 to 1, or from 1 to 0 - it
would change the encoded value by 4 (111 = 7, 011 = 3).
However, if the final gene in the chromosome is changed it will
only effect the encoded value by 1 (111 = 7, 110 = 6). Here
the mutation operator has a different effect on the candidate
solution depending on which gene in its chromosome is being
operated on. This disparity isnt ideal as it will reduce
performance and predictability of the algorithm. For this
example, it would have been better to use an integer with a
complimentary mutation operator which could add or subtract a
relatively small amount to the genes value.
The Search Process
To finish the chapter lets take a step-by-step look at the
basic process behind a genetic algorithm.
297
Fig 9.8 Genetic Algorithm Flow
1. Genetic algorithms begin by initializing a population of
candidate solutions. This is typically done randomly to
provide an even coverage of the entire search space.
2. Next, the population is evaluated by assigning a fitness value
to each individual in the population. In this stage we would
often want to take note of the current fittest solution, and the
average fitness of the population.
3. After evaluation, the algorithm decides whether it should
terminate the search depending on the termination conditions
298
set. Usually this will be because the algorithm has reached a
fixed number of generations or an adequate solution has
been found.
4. If the termination condition is not met, the population goes
through a selection stage in which individuals from the
population are selected based on their fitness score the
higher the fitness, the better chance an individual has of
being selected.
5. The next stage is to apply crossover and mutation to the
selected individuals. This stage is where new individuals are
created for the next generation.
6. At this point the new population goes back to the evaluation
step and the process starts again. We call each cycle of this
loop a generation.
7. When the termination condition is finally met, the algorithm
will break out of the loop and typically return its finial
search results back to the user.
Pseudo Code and Implementation for a Basic Genetic
Algorithm
The pseudo code for a basic genetic algorithm is as follows:
1:generation= 0;
2:population[generation]= initializePopulation(populationSize);
3:evaluatePopulation(population[generation]);
3:WhileisTerminationConditionMet() == falsedo
4: parents= selectParents(population[generation]);
5: population[generation+1] = crossover(parents);
6:population[generation+1]= mutate(population[generation
+1]);
299
7: evaluatePopulation(population[generation]);
8: generation++;
9: End loop;
The pseudo code begins with creating the genetic algorithms
initial population. This population is then evaluated to find the
fitness values of its individuals. Next, a check is run to decide if
the genetic algorithms termination condition has been met. If it
hasnt, the genetic algorithm begins looping and the population
goes through its first round of crossover and mutation before
finally being reevaluated. From here, crossover and mutation
are continuously applied until the termination condition is met,
and the genetic algorithm terminates.
This pseudo code demonstrates the basic process of a genetic
algorithm; however it is necessary that we look at each step in
more detail to fully understand how to create a satisfactory
genetic algorithm.
The first thing were going to do is set up the genetic algorithm
parameters. As covered previously, the three primary
parameters are population size, mutation rate and crossover rate.
We also introduce a concept called elitism in this chapter, and
will include that as one of the parameters of the genetic
algorithm.
To begin, create a class called GeneticAlgorithm. If youre
using Eclipse, you can do this by selecting File New
Class.
This GeneticAlgorithm class will contain the methods and
variables needed for operations of the genetic algorithm itself.
For example, this class includes the logic to handle crossover,
mutation, fitness evaluation, and termination condition
300
checking. After the class has been created, add a constructor
which accepts the four parameters: population size, mutation
rate, crossover rate, and number of elite members.
packagetest;
/**
* Lots of comments in the source that are omitted here!
*/
public class GeneticAlgorithm {
private int populationSize;
private double mutationRate;
private double crossoverRate;
private int elitismCount;
public GeneticAlgorithm(int populationSize, double
mutationRate, double crossoverRate, int elitismCount) {
this.populationSize = populationSize;
this.mutationRate = mutationRate;
this.crossoverRate = crossoverRate;
this.elitismCount = elitismCount;
}
/**
* Many more methods implemented later...
*/
}
When passed the required parameters, this constructor will
create a new instance of the GeneticAlgorithm class with the
required configuration.
Now we should create our bootstrap class recall that each
chapter will require a bootstrap class to initialize the genetic
301
algorithm and provide a starting point for the application. Name
this class AllOnesGA and define a main method:
packagetest;
public class AllOnesGA {
public static void main(String[] args) {
// Create GA object
GeneticAlgorithm ga = new GeneticAlgorithm(100,
0.01, 0.95, 0);
// Well add a lot more here...
}
}
For the time being, well just use some typical values for
the parameters, population size = 100; mutation rate = 0.01;
crossover rate = 0.95, and an elitism count of 0 (effectively
disabling it for now). After you have completed your
implementation at the end of the chapter, you can experiment
with how changing these parameters affect the performance of
the algorithm.
Our next step is to initialize a population of potential
solutions. This is usually done randomly, but occasionally it
might be preferable to initialize the population more
systematically, possibly to make use of known information
about the search space. In this example, each individual in the
population will be initialized randomly. We can do this by
selecting a value of 1 or 0 for each gene in a chromosome at
random.
Before initializing the population we need to create two
classes, one to manage and create the population and the other
to manage and create the populations individuals. It will be
these classes that contain the methods to fetch an individuals
302
fitness, or get the fittest individual in the population, for
example.
First lets start by creating the Individual class. Note that
weve omitted all the comments and method docblocks below
to save paper! You can find a thoroughly annotated version of
this class in the accompanying Eclipse project.
package test;
public class Individual {
private int[] chromosome;
private double fitness = -1;
public Individual(int[] chromosome) {
// Create individual chromosome
this.chromosome = chromosome;
}
public Individual(int chromosomeLength) {
this.chromosome = new int[chromosomeLength];
for (int gene = 0; gene < chromosomeLength; gene++) {
if (0.5 < Math.random()) {
this.setGene(gene, 1);
} else {
this.setGene(gene, 0);
}
}
public int[] getChromosome() {
return this.chromosome;
}
303
public int getChromosomeLength() {
return this.chromosome.length;
}
public void setGene(int offset, int gene) {
this.chromosome[offset] = gene;
}
public int getGene(int offset) {
return this.chromosome[offset];
}
public void setFitness(double fitness) {
this.fitness = fitness;
}
public double getFitness() {
return this.fitness;
}
public String toString() {
String output = "";
for (int gene = 0; gene < this.chromosome.length;
gene++) {
output += this.chromosome[gene];
}
return output;
}
}
The Individual class represents a single candidate solution
and is primarily responsible for storing and manipulating a
chromosome. Note that the Individual class also has two
304
constructors. One constructor accepts an integer (representing
the length of the chromosome) and will create a random
chromosome when initializing the object. The other constructor
accepts an integer array and uses that as the chromosome.
Aside from managing the Individuals chromosome it also
keeps track of the individuals fitness value, and also knows
how to print itself as a string.
The next step is to create the Population class which
provides the functionality needed to manage a group of
individuals in a population.
As usual, comments and docblocks have been omitted from
this chapter; be sure to look at the Eclipse project for more
context!
Package test;
import java.util.Arrays;
import java.util.Comparator;
public class Population {
private Individual population[];
private double populationFitness = -1;
public Population(int populationSize) {
this.population = new Individual[populationSize];
}
public Population(int populationSize, int
chromosomeLength) {
this.population = new Individual[populationSize];
305
for (int individualCount = 0; individualCount <
populationSize; individualCount++) {
Individual individual = new
Individual(chromosomeLength);
this.population[individualCount] = individual;
}
}
public Individual[] getIndividuals() {
return this.population;
}
public Individual getFittest(int offset) {
Arrays.sort(this.population, new
Comparator<Individual>() {
@Override
public int compare(Individual o1, Individual o2) {
if (o1.getFitness() > o2.getFitness()) {
return -1;
} else if (o1.getFitness() < o2.getFitness()) {
return 1;
}
return 0;
}
});
return this.population[offset];
}
public void setPopulationFitness(double fitness) {
this.populationFitness = fitness;
}
306
public double getPopulationFitness() {
return this.populationFitness;
}
public int size() {
return this.population.length;
}
public Individual setIndividual(int offset, Individual
individual) {
return population[offset] = individual;
}
public Individual getIndividual(int offset) {
return population[offset];
}
public void shuffle() {
Random rnd = new Random();
for (int i = population.length - 1; i > 0; i--) {
int index = rnd.nextInt(i + 1);
Individual a = population[index];
population[index] = population[i];
population[i] = a;
}
}
}
The population class is quite simple; its main function is to
hold an array of individuals which can be accessed as needed in
a convenient way by the class methods. Methods such as the
getFittest( ) and setIndividual( ) are examples of methods which
can access and update the individuals in the population. In
307
addition to holding the individuals it also stores the populations
total fitness which will become important later on when
implementing the selection method.
Now that we have our population and individual classes, we
can implement them in the GeneticAlgorithm class. To do so,
simply create a method named initPopulatio anywhere in the
GeneticAlgorithm class.
public class GeneticAlgorithm {
/**
* The constructor we created earlier is up here...
*/
public Population initPopulation(int chromosomeLength) {
Population population = new
Population(this.populationSize, chromosomeLength);
return population;
}
/**
* We still have lots of methods to implement down here...
*/
}
Now that we have a Population and an Individual class, we
can return to our AllOnesGA class and start working with the
initPopulation method. Recall that the AllOnesGA class
only has a main method, and that it represents the pseudocode
presented earlier in the chapter.
When initializing the population in the main method, we
also need to specify the length of individuals chromosomes
here we are going to use a length of 50:
308
public class AllOnesGA {
public static void main(String[] args){
// Create GA object
GeneticAlgorithm ga = new GeneticAlgorithm(100, 0.01,
0.95, 0);
// Initialize population
Population population = ga.initPopulation(50);
}
}
In the evaluation stage, each individual in the population
has their fitness value calculated and stored for future use. To
calculate the fitness of an individual we use a function known
as the fitness function.
Genetic algorithms operate by using selection to guide the
evolutionary process towards better individuals. Because its the
fitness function that makes this selection possible, its important
that the fitness function is designed well and provides an
accurate value for the individuals fitness. If the fitness function
isnt well-designed it can take longer to find a solution which
satisfies the minimum criteria, or possibly, fail to find an
acceptable solution at all.
Fitness functions will often be the most computationally
demanding components of a genetic algorithm. Because of this,
its important that the fitness function is also well optimized
helping to prevent bottlenecks and allowing the algorithm to run
efficiently.
Each specific optimization problem requires a uniquely
developed fitness function. In our example of the all-ones
problem, the fitness function is rather straightforward, simply
309
counting the number of ones found within an individuals
chromosome.
Now add a calcFitness method to the GeneticAlgorithm
class. This method should count the number of ones in the
chromosome, and then normalize the output to be between zero
and one by dividing by the chromosome length. You may add
this method anywhere in the GeneticAlgorithm class, so weve
omitted the surrounding code below:
public double calcFitness(Individual individual) {
// Track number of correct genes
int correctGenes = 0;
// Loop over individuals genes
for (int geneIndex = 0; geneIndex <
individual.getChromosomeLength(); geneIndex++) {
// Add one fitness point for each "1" found
if (individual.getGene(geneIndex) == 1) {
correctGenes += 1;
}
}
// Calculate fitness
double fitness = (double) correctGenes
/ individual.getChromosomeLength();
// Store fitness
individual.setFitness(fitness);
return fitness;
}
310
We also need a simple helper method to loop over every
individual in the population and evaluate them (i.e., call
calcFitness on each individual). Lets call this method
evalPopulation and add it to the GeneticAlgorithm class as well.
It should look like the following, and again you may add it
anywhere:
public void evalPopulation(Population population) {
double populationFitness = 0;
for (Individual individual : population.getIndividuals()) {
populationFitness += calcFitness(individual);
}
population.setPopulationFitness(populationFitness);
}
At this point, the GeneticAlgorithm class should have the
following methods in it. For brevitys sake, weve omitted the
body of the functions and are just showing a collapsed view of
the class:
package chapter2;
public class GeneticAlgorithm {
private int populationSize;
private double mutationRate;
private double crossoverRate;
private int elitismCount;
public GeneticAlgorithm(int populationSize, double
mutationRate, double crossoverRate, int elitismCount) { }
public Population initPopulation(int chromosomeLength)
{}
311
public double calcFitness(Individual individual) { }
public void evalPopulation(Population population) { }
}
If youre missing any of these properties or methods, please
go back and implement them now. We have four more methods
to implement in the GeneticAlgorithm class:
isTerminationConditionMet, selectParent, crossoverPopulation,
and mutatePopulation.
The next thing needed is to check if our termination
condition has been met. There are many different types of
termination conditions. Sometimes, its possible to know what
the optimal solution is (rather, its possible to know the fitness
value of the optimal solution), in which case we can check
directly for the correct solution. However, its not always
possible to know what the fitness of the best solution is, so we
can terminate when the solution has become good enough;
that is, whenever the solution exceeds some fitness threshold.
We could also terminate when the algorithm has run for too
long (too many generations), or we can combine a number of
factors when deciding to terminate the algorithm.
Due to the simplicity of the all-ones problem, and the fact
that we know the correct fitness should be 1, its reasonable in
this case to terminate when the correct solution has been found.
This wont always be the case! In fact, it will only rarely be the
case but were lucky this is an easy problem.
To begin, we must first construct a function which can
check if our termination condition has occurred. We can do this
by adding the following code to the GeneticAlgorithm class.
Add this anywhere, and as usual weve omitted the surrounding
class for brevitys sake.
312
public boolean isTerminationConditionMet(Population
population) {
for (Individual individual : population.getIndividuals()) {
if (individual.getFitness() == 1) {
return true;
}
}
return false;
}
The above method checks each individual in the population
and will return true indicating that weve found a termination
condition and can stop if the fitness of any individual in the
population is 1.
Now that the termination condition has been built, a loop
can be added to the main bootstrap method in the AllOnesGA
class using the newly added termination check as its loop
conditional. When the termination check returns true, the
genetic algorithm will stop looping and return its results.
To create the evolution loop, modify the main method of
our executive AllOnesGA class to represent the following. The
first two lines of the snippet below are already in the main
method. By adding this code were continuing to implement the
pseudocode presented at the beginning of the chapter recall
that the main method is a concrete representation of the
genetic algorithm pseudocode. Heres what the main method
should look like now:
public static void main(String[] args) {
// These two lines were already here:
GeneticAlgorithm ga = new GeneticAlgorithm(100, 0.001,
313
0.95, 0);
Population population = ga.initPopulation(50);
// The following is the new code you should be adding:
ga.evalPopulation(population);
int generation = 1;
while (ga.isTerminationConditionMet(population) == false)
{
// Print fittest individual from population
System.out.println("Best solution: " +
population.getFittest(0).toString());
// Apply crossover
// TODO!
// Apply mutation
// TODO!
// Evaluate population
ga.evalPopulation(population);
// Increment the current generation
generation++;
}
System.out.println("Found solution in " + generation + "
generations");
System.out.println("Best solution: " +
population.getFittest(0).toString());
314
In addition to the various selection methods that can be
used during crossover, there are also different methods to
exchange the genetic information between two individuals.
Different problems have slightly different properties and work
better with specific crossover methods. For example, the all-
ones problem simply requires a string that consists entirely of
1s. A string of 00111 has the same fitness value as a string of
10101 they both contain three 1s. With genetic algorithms
of this type, this isnt always the case. Imagine we are trying to
create a string which lists, in order, the numbers one to five. In
this case the string 12345 has a very different fitness value
from 52431. This is because were not just looking for the
correct numbers, but also the correct order. For problems such
as this, a crossover method that respects the order of the genes
is preferable.
The crossover method we will be implementing here
is uniform crossover. In this method each gene of the offspring
has a 50% change of coming from either its first parent or its
second parent.
Fig 9.9 Crossover
To implement roulette wheel selection, add a selectParent( )
method anywhere in the GeneticAlgorithm class.
315
public Individual selectParent(Population population) {
// Get individuals
Individual individuals[] = population.getIndividuals();
// Spin roulette wheel
double populationFitness
= population.getPopulationFitness();
double rouletteWheelPosition = Math.random()
* populationFitness;
// Find parent
double spinWheel = 0;
for (Individual individual : individuals) {
spinWheel += individual.getFitness();
if (spinWheel >= rouletteWheelPosition) {
return individual;
}
}
return individuals[population.size() - 1];
}
The selectParent( ) method essentially runs a roulette wheel
in reverse; at a casino, the wheel already has markings on it, and
then you spin the wheel and wait for the ball to drop into
position. Here, however, we select a random position first and
then work backward to figure out which individual lies in that
position. Algorithmically, its easier this way. Choose a random
number between 0 and the total population fitness, then loop
through each individual, summing their fitnesses as you go,
until you reach the random position you chose at the outset.
Now that the selection method has been added, the next step
is to create the crossover method using this selectParent( )
316
method to select the crossover mates. To begin, add the
following crossover method to the GeneticAlgorithm class.
public Population crossoverPopulation(Population population) {
// Create new population
Population newPopulation = new
Population(population.size());
// Loop over current population by fitness
for (int populationIndex = 0; populationIndex <
population.size(); populationIndex++) {
Individual parent1
= population.getFittest(populationIndex);
// Apply crossover to this individual?
if (this.crossoverRate > Math.random() &&
populationIndex > this.elitismCount) {
// Initialize offspring
Individual offspring = new
Individual(parent1.getChromosomeLength());
// Find second parent
Individual parent2 = selectParent(population);
// Loop over genome
for (int geneIndex = 0; geneIndex <
parent1.getChromosomeLength(); geneIndex++) {
// Use half of parent1's genes and half of
parent2's genes
if (0.5 > Math.random()) {
offspring.setGene(geneIndex,
parent1.getGene(geneIndex));
} else {
317
offspring.setGene(geneIndex,
parent2.getGene(geneIndex));
}
}
// Add offspring to new population
newPopulation.setIndividual(populationIndex,
offspring);
} else {
// Add individual to new population without
applying crossover
newPopulation.setIndividual(populationIndex,
parent1);
}
}
return newPopulation;
}
In the first line of the crossoverPopulation( ) method, a new
empty population is created for the next generation. Next, the
population is looped over and the crossover rate is used to
consider each individual for crossover. (Theres also a
mysterious elitism term here, which well discuss in the next
section.) If the individual doesnt go through crossover, its
added straight to the next population, otherwise a new
individual is created. The offsprings chromosome is filled by
looping over the parent chromosomes and randomly adding
genes from each parent to the offsprings chromosome. When
this crossover process has finished for each individual of the
population, the crossover method returns the next generations
population.
318
From here we can implement the crossover function into
our main method in the AllOnesGA class. The entire
AllOnesGA class and main method is printed below; however
the only change from before is the addition of the line that calls
crossoverPopulation( ) below the Apply crossover comment.
package test;
public class AllOnesGA {
public static void main(String[] args) {
// Create GA object
GeneticAlgorithm ga = new GeneticAlgorithm(100,
0.001, 0.95, 0);
// Initialize population
Population population = ga.initPopulation(50);
// Evaluate population
ga.evalPopulation(population);
// Keep track of current generation
int generation = 1;
while (ga.isTerminationConditionMet(population) ==
false) {
// Print fittest individual from population
System.out.println("Best solution: " +
population.getFittest(0).toString());
// Apply crossover
population = ga.crossoverPopulation(population);
// Apply mutation
// TODO
319
// Evaluate population
ga.evalPopulation(population);
// Increment the current generation
generation++;
}
System.out.println("Found solution in " + generation + "
generations");
System.out.println("Best solution: " +
population.getFittest(0).toString());
}
}
The last thing we need to add to complete the evolution
process is mutation. Like crossover, there are many different
mutation methods to choose from. When using binary strings
one of the more common methods is called bit flip mutation. As
you may have guessed, bit flip mutation involves flipping the
value of the bit from 1 to 0, or from 0 to 1, depending on its
initial value. When the chromosome is encoded using some
other representation, a different mutation method would usually
be implemented to make better use of the encoding.
One of the most important factors in selecting mutation and
crossover methods is to make sure that the method youve
selected still yields a valid solution. Well see this concept in
action in later chapters, but for this problem we simply need to
make sure that 0 and 1 are the only possible values a gene can
mutate to. A gene mutating to, say, 7 would give us an invalid
solution.
320
This advice seems moot and over-obvious in this chapter,
but consider a different simple problem where you need to order
the numbers one through six without repeating (i.e., end up with
123456). A mutation algorithm that simply chose a random
number between one and six could yield 126456, using 6
twice, which would be an invalid solution because each number
can only be used once. As you can see, even simple problems
sometimes require sophisticated techniques.
Similarly to crossover, mutation is applied to the individual
based on the mutation rate. If the mutation rate is set to 0.1 then
each gene has a 10% chance of being mutated during the
mutation stage.
Lets go ahead and add the mutation function to our
GeneticAlgorithm class. We can add this anywhere:
public Population mutatePopulation(Population population) {
// Initialize new population
Population newPopulation = new
Population(this.populationSize);
// Loop over current population by fitness
for (int populationIndex = 0; populationIndex <
population.size(); populationIndex++) {
Individual individual
= population.getFittest(populationIndex);
// Loop over individuals genes
for (int geneIndex = 0; geneIndex <
individual.getChromosomeLength(); geneIndex++) {
// Skip mutation if this is an elite individual
if (populationIndex >= this.elitismCount) {
// Does this gene need mutation?
321
if (this.mutationRate > Math.random()) {
// Get new gene
int newGene = 1;
if (individual.getGene(geneIndex) == 1) {
newGene = 0;
}
// Mutate gene
individual.setGene(geneIndex, newGene);
}
}
}
// Add individual to population
newPopulation.setIndividual(populationIndex,
individual);
}
// Return mutated population
return newPopulation;
}
The mutatePopulation( ) method starts by creating a new
empty population for the mutated individuals and then begins to
loop over the current population. Each individuals
chromosome is then looped over and each gene is considered
for bit flip mutation using the mutation rate. When the entire
chromosome of an individual has been looped over, the
individual is then added to the new mutation population. When
all individuals have gone through the mutation process the
mutated population is returned.
Now we can complete the final step of the evolution loop
by adding the mutate function to the main method. The finished
main method is as follows. There are only two differences from
322
the last time: first, weve added the call to mutatePopulation( )
below the Apply mutation comment. Also, weve changed the
elitismCount parameter in the new GeneticAlgorithm
constructor from 0 to 2, now that we understand how elitism
works.
package test;
public class AllOnesGA {
public static void main(String[] args) {
// Create GA object
GeneticAlgorithm ga = new GeneticAlgorithm(100,
0.001, 0.95, 2);
// Initialize population
Population population = ga.initPopulation(50);
// Evaluate population
ga.evalPopulation(population);
// Keep track of current generation
int generation = 1;
while (ga.isTerminationConditionMet(population) ==
false) {
// Print fittest individual from population
System.out.println("Best solution: " +
population.getFittest(0).toString());
// Apply crossover
population = ga.crossoverPopulation(population);
// Apply mutation
323
population = ga.mutatePopulation(population);
// Evaluate population
ga.evalPopulation(population);
// Increment the current generation
generation++;
}
System.out.println("Found solution in " + generation + "
generations");
System.out.println("Best solution: " +
population.getFittest(0).toString());
}
}
324
CHAPTER 10 NATURAL LANGUAGE
PROCESSING
Getting Started
NLP is the ability of a computer program to understand human
speech as t is spoken. It finds its use in many problems like
supporting search engine, speech recognition, document
categorization, query analysis and to summarize and classify
web pages.
In a nutshell NLP processes natural text from various sources,
performs NLP tasks and retrieve the information.
Basic techniques used:
Tokenization
Sentence detection
Clssification
extraction
Tools to support NLP - Java API
Ex of APIs
Apaches Open NLP
Stanford NLP
LingPipe
Gate
325
In this book Apaches Open NLP will be used for the egs
NLP is extensively complex field and we would be able to
address a small part of it, focusing on tasks which can be
handled/implemented in java
What is NLP?
NLP is a set of tool to derive meaningful and useful information
for natural language source such as webpage and text
documents. The info should have some commercial value.
Modern search engine uses NLP technique to show results after
user query.
When we work with a language, the terms, syntax, and
semantics, are frequently
encountered. The syntax of a language refers to the rules that
control a valid sentence structure.
As we progress with our discussions, we will introduce many
linguistic terms that will helpus better understand natural
languages and provide us with a common vocabulary to explain
the various NLP techniques. We will see how the text can be
split into individual elements and how these elements can be
classified.
In general, these approaches are used to enhance applications,
thus making them more valuable to their users. The uses of NLP
can range from relatively simple uses to those that are pushing
what is possible today. In this book, we will show examples that
illustrate simple approaches, which may be all that is required
for some problems, to the more advanced libraries and classes
available to address sophisticated needs
326
Why NLP?
NLP is used in a wide variety of disciplines to solve many
different types of problems. Text analysis is performed on text
that ranges from a few words of user input for an Internet query
to multiple documents that need to be summarized. We have
seen a large growth in the amount and availability of
unstructured data in recent years. This has taken forms such as
blogs, tweets, and various other social media. NLP is ideal for
analyzing this type of information.
Machine learning and text analysis are used frequently to
enhance an application's utility. A brief list of application areas
follow:
1. Searching
2. Machine Translation
3. Summation
4. Named Entity Recognition
5. Information Grouping
6. Parts of Speech Tagging
7. Sentiment analysis
8. Answering queries
9. Speech Recognition
10. Natural Language Generation
327
Survey of NLP Tools
There are many tools available that support NLP. Some of
these are available with the Java SE SDK but are limited in their
utility for all but the simplest types of problems. Other
libraries such as Apache's OpenNLP and LingPipe provide
extensive and sophisticated support for NLP problems. Low-
level Java support includes string libraries, such as String,
StringBuilder, and StringBuffer. These classes possess methods
that perform searching, matching, and text replacement. Regular
expressions use special encoding to match substrings. Java
provides a rich set of techniques to use regular expressions. As
discussed earlier, tokenizers are used to split text into individual
elements. Java provides support for tokenizers with: The String
class' split method and The StreamTokenizer class.
Overview of text processing tasks
Although there are numerous NLP tasks that can be performed,
we will focus only on a subset of these tasks. A brief overview
of these tasks is presented here, which is also reflected in the
following chapters: Finding Parts of Text Finding Sentences
Finding People and Things Detecting Parts of Speech
Classifying Text and Documents Extracting Relationships
Combined Approaches.
Many of these tasks are used together with other tasks to
achieve some objective. We will see this as we progress through
the book. For example, tokenization is frequently used as an
initial step in many of the other tasks. It is a fundamental and
basic step. Finding parts of text Text can be decomposed into a
number of different types of elements such as words, sentences,
and paragraphs. There are several ways of classifying these
elements. When we refer to parts of text in this book, we are
328
referring to words, sometimes called tokens. Morphology is the
study of the structure of words. We will use a number of
morphology terms in our exploration of NLP. However, there
are many ways of classifying words including the following:
Simple words: These are the common connotations of what a
word means including the 17 words of this sentence.
Morphemes: These are the smallest units of a word that is
meaningful. For example, in the word "bounded", "bound" is
considered to be a morpheme. Morphemes also include parts
such as the suffix, "ed".
Prefix/Suffix: This precedes or follows the root of a word. For
example, in the word graduation, the "ation" is a suffix based on
the word "graduate".
Synonyms: This is a word that has the same meaning as
another word. Words such as small and tiny can be recognized
as synonyms. Addressing this issue requires word sense
disambiguation.
Abbreviations: These shorten the use of a word. Instead of
using Mister Smith, we use Mr. Smith.
Acronyms: These are used extensively in many fields including
computer science. They use a combination of letters for phrases
such as FORmula TRANslation for FORTRAN. They can be
recursive such as GNU. Of course, the one we will continue to
use is NLP.
Contractions: We'll find these useful for commonly used
combinations of words such as the first word of this sentence.
Numbers: A specialized word that normally uses only digits.
However, more complex versions can include a period and a
329
special character to reflect scientific notation or numbers of a
specific base.
Finding Sentences
Partitioning text into sentences is also called Sentence
Boundary Disambiguation (SBD). This process is useful for
many downstream NLP tasks that require analysis within
sentences; for example, POS and phrase analysis typically work
within a sentence.
The SBD process
The SBD process is language dependent and is often not
straightforward. Common approaches to detect sentences
include using a set of rules or training a model to detect them. A
set of simple rules for detecting a sentence follows. The end of
a sentence is detected if: The text is terminated by a period,
question mark, or exclamation mark The period is not preceded
by an abbreviation or followed by a digit Although this works
well for most sentences, it will not work for all of them. For
example, it is not always easy to determine what an
abbreviation is, and sequences such as ellipses may be confused
with periods. Most search engines are not concerned with SBD.
They are only interested in a query's tokens and their positions.
POS taggers and other NLP tasks that perform extraction of
data will frequently process individual sentences. The detection
of sentence boundaries will help separate phrases that might
appear to span sentences.
What makes SBD difficult?
Breaking text into sentences is difficult for a number of reasons:
Punctuation is frequently ambiguous Abbreviations often
contain periods Sentences may be embedded within each other
330
by the use of quotes With more specialized text, such as tweets
and chat sessions, we may need to consider the use of new lines
or completion of clauses Punctuation ambiguity is best
illustrated by the period. It is frequently used to demark the end
of a sentence. However, it can be used in a number of other
contexts as well, including abbreviation, numbers, e-mail
addresses, and ellipses. Other punctuation characters, such as
question and exclamation marks, are also used in embedded
quotes and specialized text such as code that may be in a
document. Periods are used in a number of situations: To
terminate a sentence To end an abbreviation To end an
abbreviation and terminate a sentence For ellipses For ellipses
at the end of a sentence Embedded in quotes or brackets Most
sentences we encounter end with a period. This makes them
easy to identify. However, when they end with an abbreviation,
it a bit more difficult to identify them.
Simple Java SBDs
Sometimes, text may be simple enough that Java core support
will suffice. There are two approaches that will perform SBD:
using regular expressions and using the BreakIterator class. We
will examine both approaches here. Using regular expressions
Regular expressions can be difficult to understand. While
simple expressions are not usually a problem, as they become
more complex, their readability worsens. This is one of the
limitations of regular expressions when trying to use them for
SBD. We will present two different regular expressions. The
first expression is simple, but does not do a very good job. It
illustrates a solution that may be too simple for some problem
domains. The second is more sophisticated and does a better
job. In this example, we create a regular expression class that
matches periods, question marks, and exclamation marks. The
331
String class' split method is used to split the text into
sentences.Sample snippet is provided below.
String simple = "[.?!]";
String[] splitString = (paragraph.split(simple));
for (String string : splitString)
{ System.out.println(string); }
When determining the end of sentences we need to consider
several factors Sentences may end with exclamation marks Or
possibly questions marks Within sentences we may find
numbers like 3 14159, abbreviations such as found in Mr Smith,
and possibly ellipses either within a sentence , or at the end
of a sentence
As expected, the method splits the paragraph into characters
regardless of whether they are part of a number or abbreviation.
A second approach follows, which produces better results. This
example has been adapted from an example found at
http://stackoverflow.com/questions/5553410/regular-
expressionmatch-a-sentence. The Pattern class, which compiles
the following regular expression, is used:
[^.!?\s][^.!?]*(?:[.!?](?!['"]?\s|$)[^.!?]*)*[.!?]?['"]?(?=\s|$)
The comment in the following code sequence provides an
explanation of what each part represents:
Pattern sentencePattern = Pattern.compile(
"# Match a sentence ending in punctuation or EOS.\n" +
"[^.!?\\s] # First char is non-punct, non-ws\n"
332
+ "[^.!?]* # Greedily consume up to punctuation.\n" + "(?: #
Group for unrolling the loop.\n"
+ " [.!?] # (special) inner punctuation ok if\n"
+ " (?!['\"]?\\s|$) # not followed by ws or EOS.\n" + " [^.!?]*
# Greedily consume up to punctuation.\n" + ")* # Zero or
more (special normal*)\n"
+ "[.!?]? # Optional ending punctuation.\n"
+ "['\"]? # Optional closing quote.\n"
+ "(?=\\s|$)", Pattern.MULTILINE | Pattern.COMMENTS);
Finding People and Things
The process of finding people and things is referred to as
Named Entity Recognition (NER). Entities such as people and
places are associated with categories that have names, which
identify what they are. A named category can be as simple as
"people". Common entity types include: People Locations
Organizations Money Time URLs Finding names, locations,
and various things in a document are important and useful NLP
tasks. They are used in many places such as conducting simple
searches, processing queries, resolving references, the
disambiguation of text, and finding the meaning of text. For
example, NER is sometimes interested in only finding those
entities that belong to a single category. Using categories, the
search can be isolated to those item types. Other NLP tasks use
NER such as in POS taggers and in performing cross-
referencing tasks.
333
Techniques for name recognition
There are a number of NER techniques available. Some use
regular expressions and others are based on a predefined
dictionary. Regular expressions have a lot of expressive power
and can isolate entities. A dictionary of entity names can be
compared to tokens of text to find matches. Another common
NER approach uses trained models to detect their presence.
These models are dependent on the type of entity we are
looking for and the target language. A model that works well
for one domain, such as web pages, may not work well for a
different domain, such as medical journals.
When a model is trained, it uses an annotated block of text,
which identifies the entities of interest. To measure how well a
model has been trained, several measures are used:
Precision: It is the percentage of entities found that match
exactly the spans found in the evaluation data
Recall: It is the percentage of entities defined in the corpus that
were found in the same location
Performance measure:
It is the harmonic mean of precision and recall given by
F1 = 2 *Precision* Recall / (Recall + Precision)
We will use these measures when we cover the evaluation of
models.
NER is also known as entity identification and entity chunking.
Chunking is the analysis of text to identify its parts such as
nouns, verbs, or other components. As humans, we tend to
chunk a sentence into distinct parts. These parts form a structure
334
that we use to determine its meaning. The NER process will
create spans of text such as "Queen of England". However,
there may be other entities within these spans such as
"England".
Lists and regular expressions
One technique is to use lists of "standard" entities along with
regular expressions to identify the named entities. Named
entities are sometimes referred to as proper nouns. The standard
entities list could be a list of states, common names, months, or
frequently referenced locations. Gazetteers, which are lists that
contain geographical information used with maps, provide a
source of location-related entities. However, maintaining such
lists can be time consuming. They can also be specific to
language and locale. Making changes to the list can be tedious.
We will demonstrate this approach in the Using the
ExactDictionaryChunker class section later in this chapter.
Regular expressions can be useful in identifying entities. Their
powerful syntax provides enough flexibility in many situations
to accurately isolate the entity of interest. However, this
flexibility can also make it difficult to understand and maintain.
We will demonstrate several regular expression approaches in
this article.
Using regular expressions for NER
Regular expressions can be used to identify entities in a
document. We will investigate two general approaches: The
first one uses regular expressions as supported by Java. These
can be useful in situations where the entities are relatively
simple and consistent in their form. The second approach uses
classes designed to specifically use regular expressions. To
demonstrate this, we will use LingPipe's RegExChunker class.
335
When working with regular expressions, it is advantageous to
avoid reinventing the wheel. There are many sources for
predefined and tested expressions. One such library can be
found at http://regexlib.com/Default.aspx. We will use several
of the regular expressions in this library for our examples. To
test how well these approaches work, we will use the following
text for most of our examples:
private static String regularExpressionText = "He left his
email address (rgb@colorworks.com) and his " + "phone
number,800-555-1234. We believe his current address " + "is
100 Washington Place, Seattle, CO 12345-1234. I " +
"understand you can also call at 123-555-1234 between " +
"8:00 AM and 4:30 most days. His URL is http://example.com "
+ "and he was born on February 25, 1954 or 2/25/1954.";
Using Java's regular expressions to find entities To demonstrate
how these expressions can be used, we will start with several
simple examples. The initial example starts with the following
declaration. It is a simple expression designed to identify certain
types of phone numbers:
String phoneNumberRE = "\\d{3}-\\d{3}-\\d{4}";
We will use the following code to test our simple expressions.
The Pattern class' compile method takes a regular expression
and compiles it into a Pattern object. Its matcher method can
then be executed against the target text, which returns a
Matcher object. This object allows us to repeatedly identify
regular expression matches:
Pattern pattern = Pattern.compile(phoneNumberRE);
Matcher matcher = pattern.matcher(regularExpressionText);
336
while (matcher.find())
{ System.out.println(matcher.group() + " [" + matcher.start()
+ ":" + matcher.end() + "]"); }
The find method will return true when a match occurs. Its group
method returns the text that matches the expression. Its start and
end methods give us the position of the matched
text in the target text. When executed, we will get the following
output:
800-555-1234 [68:80] 123-555-1234 [196:208]
A number of other regular expressions can be used in a similar
manner. These are listed in the following table. The third
column is the output produced when the corresponding regular
expression is used in the previous code sequence:
Entity type Regular expression Output
URL \\b(https?|ftp|file|ldap)://[-A-Za-z0-
9+&@#/%?=~_|!:,.;]*[A-Za-z0-9+&@#/%=~_|]
http://example.com [256:274]
ZIP code [0-9]{5}(\\-?[0-9]{4})? 12345-1234 [150:160]
E-mail [a-zA-Z0-9'._%+-]+@(?:[a-zA-Z0-9-]+\\.)+[a-zA-
Z]{2,4} rgb@colorworks.com [27:45]
Time (([0-1]?[0-9])|([2][0-3])):([0-5]?[0-9])(:([0-5]?[0-9]))?
8:00 [217:221]
4:30 [229:233]
337
Date
((0?[13578]|10|12)(-|\\/)(([1-9])|(0[1-9])|([12])([0-9]?)|
(3[01]?))(-|\\/)((19)([2-9])(\\d{1})|(20)([01])(\\d{1})|
([8901])(\\d{1}))|(0?[2469]|11)(-|\\/)(([1-9])|(0[1-9])| ([12])([0-
9]?)|(3[0]?))(-|\\/)((19)([2-9])(\\d{1})|(20)
([01])(\\d{1})|([8901])(\\d{1})))
2/25/1954 [315:324]
There are many other regular expressions that we could have
used. However, these examples illustrate the basic technique.
As demonstrated with the date regular expression, some of these
can be quite complex. It is common for regular expressions to
miss some entities and to falsely report other nonentities as
entities. For example, if we replace the text with the following
expression:
regularExpressionText =
"(888)555-1111 888-SEL-HIGH 888-555-2222-J88-
W3S";
Executing the code will return this:
888-555-2222 [27:39]
It missed the first two phone numbers and falsely reported
the "part number" as a phone number. We can also search for
more than one regular expression at a time using the | operator.
In the following statement, three regular expressions are
combined using this operator. They are declared using the
corresponding entries in the previous table:
Pattern pattern = Pattern.compile(phoneNumberRE + "|"
+ timeRE + "|" + emailRegEx);
338
When executed using the original regularExpressionText
text defined at the beginning of the previous section, we get the
following output:
rgb@colorworks.com [27:45] 800-555-1234 [68:80] 123-
555-1234 [196:208] 8:00 [217:221] 4:30 [229:233]
339
CHAPTER 11 PERFORMANCE ENHANCEMENTS
IN JAVA
Getting started
This chapter is an introduction to the Garbage First (or G1)
garbage collector (GC) along with a historical perspective on
the garbage collectors in the Java HotSpot Virtual Machine
(VM), hereafter called just HotSpot, and the reasoning behind
G1s inclusion in HotSpot.
The term parallel means a multithreaded garbage collection
operation. When a GC event activity is described as parallel,
multiple threads are used to perform it. When a
garbage collector is described as parallel, it uses multiple
threads to perform garbage collection. In the case of the
HotSpot garbage collectors, almost all multithreaded GC
operations are handled by internal Java VM (JVM) threads. One
major exception to this is the G1 garbage collector, in which
some background GC work can be taken on by the application
threads
Garbage First (G1) GC
The G1 garbage collector addresses many of the
shortcomings of Parallel, Serial, and CMS GC by taking a
somewhat different approach. G1 divides the heap into a set of
regions. Most GC operations can then be performed a region at
a time rather than on the entire Java heap or an entire
generation.
In G1, the young generation is just a set of regions, which
means that it is not required to be a consecutive chunk of
memory. Similarly, the old generation is also just a set of
regions. There is no need to decide at JVM launch time which
340
regions should be part of the old or young generation. In fact,
the normal operational state for G1 is that over time the virtual
memory mapped to G1 regions moves back and forth between
the generations. A G1 region may be designated as young and
later, after a young generation collection, become available for
use elsewhere, since young generation regions are completely
evacuated to unused regions.
In the remainder of this chapter, the term available region is
used to identify regions that are unused and available for use by
G1. An available region can be used or designated as a young or
old generation region. It is possible that after a young
generation collection, a young generation region can at some
future time be used as an old generation region. Likewise, after
collection of an old generation region, it becomes an available
region that can at some future time be used as a young
generation region.
G1 young collections are parallel stop-the-world collections.
As mentioned earlier, parallel stop-the-world collections pause
all Java application threads while the garbage collector threads
execute, and the GC work is spread across multiple threads. As
with the other HotSpot garbage collectors, when a young
generation collection occurs, the entire young generation is
collected.
Old generation G1 collections are quite different from those
of the other HotSpot collectors. G1 old generation collections
do not require the entire old generation to be collected in order
to free space in the old generation. Instead, only a subset of the
old generation regions may be collected at any one time. In
addition, this subset of old generation regions is collected in
conjunction with a young collection.
341
Similar to CMS GC, there is a fail-safe to collect and
compact the entire old generation in dire situations such as
when old generation space is exhausted.
A G1 old generation collection, ignoring the fail-safe type of
collection, is a set of phases, some of which are parallel stop-
the-world and some of which are parallel concurrent. That is,
some phases are multithreaded and stop all application threads,
and others are multithreaded and execute at the same time as the
application threads. Chapters 2 and 3 provide more detail on
each of these phases.
G1 initiates an old generation collection when a Java heap
occupancy threshold is exceeded. It is important to note that the
heap occupancy threshold in G1 measures the old generation
occupancy compared to the entire Java heap. Readers who are
familiar with CMS GC remember that CMS initiates an old
generation collection using an occupancy threshold applied
against the old generation space only. In G1, once the heap
occupancy threshold is reached or exceeded, a parallel stop-the-
world initial-mark phase is scheduled to execute.
The initial-mark phase executes at the same time as the next
young GC. Once the initial-mark phase completes, a concurrent
multithreaded marking phase is initiated to mark all live objects
in the old generation. When the concurrent marking phase is
completed, a parallel stop-the-world remark phase is scheduled
to mark any objects that may have been missed due to
application threads executing concurrently with the marking
phase. At the end of the remark phase, G1 has full marking
information on the old generation regions. If there happen to be
old generation regions that do not have any live objects in them,
they can be reclaimed without any additional GC work during
the next phase of the concurrent cycle, the cleanup phase.
342
Also at the end of the remark phase, G1 can identify an
optimal set of old generations to collect.
The regions selected for inclusion in a CSet are based on how
much space can be freed and the G1 pause time target. After the
CSet has been identified, G1 schedules a GC to collect regions
in the CSet during the next several young generation GCs. That
is, over the next several young GCs, a portion of the old
generation will be collected in addition to the young generation.
This is the mixed GC type of garbage collection event
mentioned earlier.
With G1, every region that is garbage collected, regardless
of whether it is young or old generation, has its live objects
evacuated to an available region. Once the live objects have
been evacuated, the young and/or old regions that have been
collected become available regions.
An attractive outcome of evacuating live objects from old
generation regions into available regions is that the evacuated
objects end up next to each other in the virtual address space.
There is no fragmented empty space between objects.
Effectively, G1 does partial compaction of the old generation.
Remember that CMS, Parallel, and Serial GC all require a full
GC to compact the old generation and that compaction scans the
entire old generation.
Since G1 performs GC operations on a per-region basis, it is
suitable for large Java heaps. The amount of GC work can be
limited to a small set of regions even though the Java heap size
may be rather large.
The largest contributors to pause times in G1 are young and
mixed collections, so one of the design goals of G1 is to allow
the user to set a GC pause time goal. G1 attempts to meet the
343
specified pause time goal through adaptive sizing of the Java
heap. It will automatically adjust the size of the young
generation and the total Java heap size based on the pause time
goal. The lower the pause time goal, the smaller the young
generation and the larger the total heap size, making the old
generation relatively large.
A G1 design goal is to limit required tuning to setting a
maximum Java heap size and specifying a GC pause time
target. Otherwise, G1 is designed to dynamically tune itself
using internal heuristics. At the time of writing, the heuristics
within G1 are where most active HotSpot GC development is
taking place. Also as of this writing, G1 may require additional
tuning in some cases, but the prerequisites to building good
heuristics are present and look promising. For advice on how to
tune G1, see Chapter 3.
To summarize, G1 scales better than the other garbage
collectors for large Java heaps by splitting the Java heap into
regions. G1 deals with Java heap fragmentation with the help of
partial compactions, and it does almost all its work in a
multithreaded fashion.
As of this writing, G1 primarily targets the use case of large
Java heaps with reasonably low pauses, and also those
applications that are using CMS GC. There are plans to use G1
to also target the throughput use case, but for applications
looking for high throughput that can tolerate longer GC pauses,
Parallel GC is currently the better choice.
As mentioned earlier, G1 divides the Java heap into regions.
The region size can vary depending on the size of the heap but
must be a power of 2 and at least 1MB and at most 32MB.
Possible region sizes are therefore 1, 2, 4, 8, 16, and 32MB. All
344
regions are the same size, and their size does not change during
execution of the JVM. The region size calculation is based on
the average of the initial and maximum Java heap sizes such
that there are about 2000 regions for that average heap size. As
an example, for a 16GB Java heap with -Xmx16g -
Xms16g command-line options, G1 will choose a region size of
16GB/2000 = 8MB.
If the initial and maximum Java heap sizes are far apart or if
the heap size is very large, it is possible to have many more than
2000 regions. Similarly, a small heap size may end up with
many fewer than 2000 regions.
Each region has an associated remembered set (a collection
of the locations that contain pointers into the region, shortened
to RSet). The total RSet size is limited but noticeable, so the
number of regions has a direct effect on HotSpots memory
footprint. The total size of the RSets heavily depends on
application behavior. At the low end, RSet overhead is around 1
percent and at the high end 20 percent of the heap size.
A particular region is used for only one purpose at a time,
but when the region is included in a collection, it will be
completely evacuated and released as an available region.
There are several types of regions in G1. Available regions
are currently unused. Eden regions constitute the young
generation eden space, and survivor regions constitute the
young generation survivor space. The set of all eden and
survivor regions together is the young generation. The number
of eden or survivor regions can change from one GC to the
next, between young, mixed, or full GCs. Old generation
regions comprise most of the old generation. Finally,
humongous regions are considered to be part of the old
345
generation and contain objects whose size is 50 percent or more
of a region. Until a JDK 8u40 change, humongous regions were
collected as part of the old generation, but in JDK 8u40 certain
humongous regions are collected as part of a young collection.
There is more detail on humongous regions later in this chapter.
The fact that a region can be used for any purpose means
that there is no need to partition the heap into contiguous young
and old generation segments. Instead, G1 heuristics estimate
how many regions the young generation can consist of and still
be collected within a given GC pause time target. As the
application starts allocating objects, G1 chooses an available
region, designates it as an eden region, and starts handing out
memory chunks from it to Java threads. Once the region is full,
another unused region is designated an eden region. The process
continues until the maximum number of eden regions is
reached, at which point a young GC is initiated.
During a young GC, all young regions, eden and survivor,
are collected. All live objects in those regions are evacuated to
either a new survivor region or to an old generation region.
Available regions are tagged as survivor or old generation
regions as needed when the current evacuation target region
becomes full.
When the occupancy of the old generation space, after a GC,
reaches or exceeds the initiating heap occupancy threshold, G1
initiates an old generation collection. The occupancy threshold
is controlled by the command-line option -
XX:InitiatingHeapOccupancyPercent, which defaults to 45
percent of the Java heap.
G1 can reclaim old generation regions early when the
marking phase shows that they contain no live objects. Such
346
regions are added to the available region set. Old regions
containing live objects are scheduled to be included in a future
mixed collection.
G1 uses multiple concurrent marking threads. In an attempt
to avoid stealing too much CPU from application threads,
marking threads do their work in bursts. They do as much work
as they can fit into a given time slot and then pause for a while,
allowing the Java threads to execute instead.
Humongous Objects
G1 deals specially with large object allocations, or what G1
calls humongous objects. As mentioned earlier, a humongous
object is an object that is 50 percent or more of a region size.
That size includes the Java object header. Object header sizes
vary between 32- and 64-bit HotSpot VMs. The header size for
a given object within a given HotSpot VM can be obtained
using the Java Object Layout tool, also known as JOL. As of
this writing, the Java Object Layout tool can be found on the
Internet [2].
When a humongous object allocation occurs, G1 locates a
set of consecutive available regions that together add up to
enough memory to contain the humongous object. The first
region is tagged as a humongous start region and the other
regions are marked as humongous continues regions. If there
are not enough consecutive available regions, G1 will do a full
GC to compact the Java heap.
Humongous regions are considered part of the old
generation, but they contain only one object. This property
allows G1 to eagerly collect a humongous region when the
concurrent marking phase detects that it is no longer live. When
347
this happens, all the regions containing the humongous object
can be reclaimed at once.
A potential challenge for G1 is that short-lived humongous
objects may not be reclaimed until well past the point at which
they become unreferenced. JDK 8u40 implemented a method
to, in some cases, reclaim a humongous region during a young
collection. Avoiding frequent humongous object allocations can
be crucial to achieving application performance goals when
using G1. The enhancements available in JDK 8u40 help but
may not be a solution for all applications having many short-
lived humongous objects.
Full Garbage Collections
Full GCs in G1 are implemented using the same algorithm as
the Serial GC collector. When a full GC occurs, a full
compaction of the entire Java heap is performed. This ensures
that the maximum amount of free memory is available to the
system. It is important to note that full GCs in G1 are single-
threaded and as a result may introduce exceptionally long pause
times. Also, G1 is designed such that full GCs are not expected
to be necessary. G1 is expected to satisfy application
performance goals without requiring a full GC and can usually
be tuned such that a full GC is not needed.
Concurrent Cycle
A G1 concurrent cycle includes the activity of several phases:
initial marking, concurrent root region scanning, concurrent
marking, remarking, and cleanup. The beginning of a
concurrent cycle is the initial mark, and the ending phase is
cleanup. All these phases are considered part of marking the
live object graph with the exception of the cleanup phase.
348
The purpose of the initial-mark phase is to gather all GC
roots. Roots are the starting points of the object graphs. To
collect root references from application threads, the application
threads must be stopped; thus the initial-mark phase is stop-the-
world. In G1, the initial marking is done as part of a young GC
pause since a young GC must gather all roots anyway.
The marking operation must also scan and follow all
references from objects in the survivor regions. This is what the
concurrent root region scanning phase does. During this phase
all Java threads are allowed to execute, so no application pauses
occur. The only limitation is that the scanning must be
completed before the next GC is allowed to start. The reason for
that is that a new GC will generate a new set of survivor objects
that are different from the initial marks survivor objects.
Most marking work is done during the concurrent marking
phase. Multiple threads cooperate to mark the live object graph.
All Java threads are allowed to execute at the same time as the
concurrent marking threads, so there is no pause in the
application, though an application may experience some
throughput reduction.
After concurrent marking is done, another stop-the-world
phase is needed to finalize all marking work. This phase is
called the remark phase and is usually a very short stop-the-
world pause.
The final phase of concurrent marking is the cleanup phase.
In this phase, regions that were found not to contain any live
objects are reclaimed. These regions are not included in a young
or mixed GC since they contain no live objects. They are added
to the list of available regions.
349
The marking phases must be completed in order to find out
what objects are live so as to make informed decisions about
what regions to include in the mixed GCs. Since it is the mixed
GCs that are the primary mechanism for freeing up memory in
G1, it is important that the marking phase finishes before G1
runs out of available regions. If the marking phase does not
finish prior to running out of available regions, G1 will fall back
to a full GC to free up memory. This is reliable but slow.
Heap Sizing
The Java heap size in G1 is always a multiple of the region size.
Except for that limitation, G1 can grow and shrink the heap size
dynamically between -Xms and -Xmx just as the other HotSpot
GCs do.
G1 may increase the Java heap size for several reasons:
1. An increase in size can occur based on heap size calculations
during a full GC.
2. When a young or mixed GC occurs, G1 calculates the time
spent to perform the GC compared to the time spent
executing the Java application. If too much time is spent in
GC according to the command-line setting -
XX:GCTimeRatio, the Java heap size is increased. The idea
behind growing the Java heap size in this situation is to
allow GCs to happen less frequently so that the time spent in
GC compared to the time spent executing the application is
reduced.
The default value for -XX:GCTimeRatio in G1 is 9. All
other HotSpot garbage collectors default to a value of 99. The
larger the value for GCTimeRatio, the more aggressive the
increase in Java heap size. The other HotSpot collectors are thus
350
more aggressive in their decision to increase Java heap size and
by default are targeted to spend less time in GC relative to the
time spent executing the application.
3. If an object allocation fails, even after having done a GC,
rather than immediately falling back to doing a full GC, G1
will attempt to increase the heap size to satisfy the object
allocation.
4. If a humongous object allocation fails to find enough
consecutive free regions to allocate the object, G1 will try to
expand the Java heap to obtain more available regions rather
than doing a full GC.
5. When a GC requests a new region into which to evacuate
objects, G1 will prefer to increase the size of the Java heap
to obtain a new region rather than failing the GC and falling
back to a full GC in an attempt to find an available region.
351