Emerging trends in DW
Testing
Big data
Mobile BI
Big Data
3 Vs
Volume , Velocity and Variety
Silent 4th V
Value
Hadoop Big data
framework
Hadoop is an opensource framework for handling big data
processing.(inspired by a paper published by Google on GFS
and mapreduce)
2 main components areHDFS- Hadoop distributed file system
Mapreduce Programming framework for aggregation,
counts written in Java.
Other tools supported by HadoopApache Pig Higher level abstraction language for
processing data.
Apache Hive SQL like language to process data stored in
Hadoop.
Not fully SQL compilant.
HDFS key points
HDFS was designed to be a scalable, fault-tolerant,
distributed storage system that works closely with
MapReduce.
By distributing storage and computation across many
servers, the combined storage resource can grow with
demand while remaining economical at every size.
Files are split into blocks (single unit of
storage)
Managed by Namenode, stored by Datanode
Transparent to user
Replicated across machines at load time
Same block is stored on multiple machines
Good for fault-tolerance and access
Default replication is 3
HDFS CLI
Mkdir It will take path uris as argument and creates directory or
directories.
hadoop fs -mkdir <paths>
Eg- hadoop fs -mkdir /user/hadoop/dir1 /user/hadoop/dir2 hadoop fs -mkdir
hdfs://nn1.example.com/user/hadoop/dir
Ls - Lists the contents of a directory
hadoop fs -ls <args>
Eg- hadoop fs -ls /user/hadoop/dir1
Put -Copy single src file, or multiple src files from local file system to the
Hadoop data file system
hadoop fs -put <localsrc> ... <HDFS_dest_Path>
Get - Copies/Downloads files to the local file system
hadoop fs -get <hdfs_src> <localdst>
Sample Map reduce
Sample map reduce
public static class M ap extends M apper< LongW ritable, Text, Text,
IntW ritable> {
private fi
n alstatic IntW ritable one = new IntW ritable(1);
private Text w ord = new Text();
public void m ap(LongW ritable key, Text value, Context context)
throw s IO Exception, InterruptedException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
w hile (tokenizer.hasM oreTokens()) {
w ord.set(tokenizer.nextToken());
context.w rite(w ord, one);
}
}
}
public static class Reduce extends Reducer<Text,
IntWritable, Text, IntWritable> {
public void reduce(Text key,
Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
context.write(key, new IntWritable(sum));
}
}
Sample Apache Pig query
inpt = load '/path/to/corpus using TextLoader as (line:
chararray);
words = foreach inpt generate flatten(TOKENIZE(line)) as
word;
grpd = group words by word;
cntd = foreach grpd generate group, COUNT(words);
dump cntd;
The above pig script, first splits each line into words using
theTOKENIZEoperator. The tokenize function creates a bag of
words. Using theFLATTENfunction, the bag is converted into a
tuple. In the third statement, the words are grouped together so
that the count can be computed
Sample Hive query
SELECT [ALL | DISTINCT] select_expr,
select_expr, ...
FROM table_reference
[WHERE where_condition]
[GROUP BY col_list] [HAVING
having_condition] [CLUSTER BY
col_list | [DISTRIBUTE BY col_list]
[SORT BY col_list]] [LIMIT number]
Other Big data technologies
No SQL Characteristics
1. No particular schema
2. Highly scalable
3. Availability
4. Speed of access
Monogo DB(Document oriented), Cassandra, Hbase (Hadoop based)
All supports Java Programming API. 2 of the above built in Java.
All have its own query language CQL, Hbase query language
This technology deserves a entire presentation itself
Reference architecture
Different layers
1. Source to Hadoop data ingestion
(EL)
2. Hadoop processing using
mapreduce, Pig hive (T)
3. Loading of processed data from
hadoop to EDW.
4. BI reports and analytics using the
processed data stored in hadoop via
Hive
Validation of pre Hadoop processing-
Data from various sources like weblog, social media,
transactional data etc are loaded into hadoop.
Methodology Compare the input data file against source system
data to ensure data is extracted correctly.
Validating the files are loaded into HDFS correctly
Comparison using script to read data from HDFS(Java
API) and check against the respective source data.
Sample testing. But identifying the sample can be
challenge since volume is huge. Consider critical
business scenarios.
Code for accessing HDFS file system
public static void main(String[] args) throws IOException {
FileSystem hdfs =FileSystem.get(new Configuration());
Path homeDir=hdfs.getHomeDirectory();
//Print the home directory
System.out.println("Home folder -" +homeDir);
}
Code for copying File from local file system to HDFS
Path localFilePath = new Path("c://localdata/datafile1.txt");
Path hdfsFilePath=new (hdfs://localhost:9000/test/file1.txt");
hdfs.copyFromLocalFile(localFilePath, hdfsFilePath);
Copying File from HDFS to local file system
localFilePath=new Path("c://hdfsdata/datafile1.txt");
hdfs.copyToLocalFile(hdfsFilePath, localFilePath);
Reading data From HDFS File
BufferedReader bfr=new BufferedReader(new
InputStreamReader(hdfs.open(newFilePath)));
String str = null;
while ((str = bfr.readLine())!= null)
{
System.out.println(str);
}
Validation of Hadoop
processing
Validating the business logic on standalone mode.
Validating the mapreduce process to verify the key value
pairs is generated correctly.
Validating the output data against source files to ensure
data processing is done correctly.
Creating MRUnit ( testing framework) for mapreduce. Low
level unit testing.
Alternate approach - Write Pig scripts to implement the
business logic and verify the results generated from
mapreduce.
Validating of data extract and load
into EDW (Sqoop)
Validating that the transformation rules
are applied correctly
Validating that there is no data corruption
by comparing target table data against
HDFS file data. (Hive can be used )
Validating the aggregation of data and
data integrity.
Validation of reports
(Hive/EDW)
Validating by firing queries against
HDFS using HIVE
Validating by running SQL against
the EDW
Normal report testing approach.
Challanges
Volume- Prepare compare script to validate the data.
To reduce the time we can run all the comparison
script in parallel just like data is processed in
mapreduce.
https://community.informatica.com/solutions/file_tabl
e_compare_utility_hdfs_hive
Sample data ensuring maximum scenarios is covered.
Variety- Unstructured data can be converted to
structured form and then compared.
Non Functional testing
Performance testing
Failover testing
Realtime support
Summary
1. Data ingestion into hadoop via Sqoop,Flume,Kafka
2.Data processing within Hadoop using
mapreduce,Pig , Hive. (or ETL tools like Informatica
Big data edition,Talend)
3.Reporting using reporting tools like Tableau,
Microstrategy.(via HIVE)
4.Loading of data from Hadoop to EDW
(Teradata/Oracle) or analytical database
(GreenPlum/Netezzea)
Why Hadoop is gaining
popularity
1. Open source
2. Can run on commodity servers
3. Can support any type of
unstructured data.(which makes it
win over parallel database)
4.Fault tolerant
Use cases
1. ETL processing moved to Hadoop to take
advantage of processing of structured/unstructured
data
2. Machine learning over Hadoop. Eg
Recommendation engine (Amazon,Flipkart)
3.Fraud detection in credit card or insurance industry
4. Retail Understanding customers buying pattern,
Market basket analysis etc.
Emerging trends in Big data
1. Yarn A generic framework over Hadoop which
will allow other applications(other than
mapreduce) to be built over Hadoop and also
allow integration of other applications with
Hadoop
2. Stream Processing using Strom.(again
opensource)
3. In memory cluster computing using Apache
Spark over Hadoop clusters.
Data Mining in Hadoop Market
Basket Analysis
Market Basket Analysis is a modelling
technique based upon the theory
that if you buy a certain group of
items, you are more (or less) likely to
buy another group of items.
Collect list of pair of transactions
items most frequently occurring
together at a store
Initial data
Transaction 1: cracker, icecream, soda
Transaction 2: chicken, pizza, coke, bread
Transaction 3: baguette, soda, hering, cracker,
soda
Transaction 4: bourbon, coke, turkey,bread
Transaction 5: sardines, soda, chicken, coke
Transaction 6: apples, peppers, avocado, steak
Transaction 7: sardines, apples, peppers, avocado,
steak
Data setup
< (cracker, icecream), (cracker, soda),(icecream
and soda) >
< (chicken, pizza), (chicken, coke), (chicken,
bread),(pizza,coke) .. >
< (baguette, soda), (baguette, hering), (baguette,
cracker), (baguette, soda) >
< (bourbon, coke), (bourbon, turkey) >
< (sardines, soda), (sardines, chicken), (sardines,
coke)
>
Map phase
< ((cracker, icecream),1), ((cracker,
soda),1),((icecream and soda),1)>
<((chicken, pizza),1), ((chicken, coke),1),
((chicken, bread),1),((cracker,soda),1)>
Output
((cracker,icecream),1)
((cracker,soda),1) ((cracker,soda),1)
((key) , value)
Reduce phase
Input ((cracker,icecream),<1,1,1.>)
((cracker,soda) ,<1,1>)
Output ((cracker,icecream),540)
((cracker,soda), 240)
Result1.Icecream should be placed nearby to cracker.
2.Keep some combo offers for the combination to increase
sale.
How to test the above code
Create a sample dataset covering
different types of items combination
and take count. (using excel )
Run the mapreduce jobflow for MBA
Verify the results from step1 and
step2
What next
1. Learn Java and mapreduce (not
mandatory but will definitely help)
2. Learn Hadoop .(Hadoop
Definitive Guide by Tom white,
Hadoop in Action Chuck lam)
4.Learn some stuff related to Pig and
Hive.
5. Plenty of tutorials in net
3. Install a free VM
(Cloudera/Hortonworks)
Thank you
Questions