This document provides detailed instructions for processing nested JSON data using Apache Spark, specifically focusing on analyzing a public baby names dataset. It outlines the steps to read data from a URL, create a DataFrame, extract required fields, and analyze the data through queries and visualization. The document also highlights the integration services offered by Aegis Software Canada and includes company contact information.
Introduction to processing nested JSON data using Apache Spark, focusing on reading, extracting, analyzing, and visualizing data from a Baby names public dataset.
Demonstrating the structure of the JSON dataset using Spark's printSchema method and discussing the metadata it contains.
Identifying the fields in the JSON data we will analyze and explaining how to extract these fields using a temporary view and the explode function.
Performing queries to find popular baby name trends and visualizing the results using Databricks graphing tools.
Introduction to Aegis Software Canada, highlighting their expertise in Apache Spark integration and providing contact information.
Instructions for use
Letus read a public JSON dataset available on the internet. Extract required
fields from nested data, and analyze the dataset to get some insights. Here
I’m using the Baby names public data set available on the internet for this
demo.
2
What are we performing in this demo?
╺ Read data from the URL using scala API
╺ Convert the read data into a dataframe
╺ Extract the required fields from the nested JSON dataset
╺ Analyze the data by writing queries
╺ Visualize the processed data
3.
3
Let us reada public JSON dataset available on the internet. Extract required fields from
nested data, and analyze the dataset to get some insights. Here I’m using the Baby names
public data set available on the internet for this demo.
After this, we use the jsonString Val created above and create a dataframe using Spark API.
We need to import spark.implicits to convert Sequence of Strings to a Dataset, and then
we create a dataframe out of it.
4.
Now let ussee the schema of the JSON using printSchema method:
5.
5
Now let ussee the schema of the JSON using printSchema method:
|-- data: array (nullable = true)
| |-- element: array (containsNull = true)
| | |-- element: string (containsNull = true))
Also, it contains metadata about the data, let’s not worry about it, for now. But you can
have a look at it when you run this in your machine. Mainly it contains columns field
information in metadata, which I have extracted for you to have a better understanding of
the data we will work on.
6.
We have belowfields within an Array of data that we are going to analyze.
6
╺ meta
╺ Year
╺ first_name
╺ County
╺ Sex
╺ Count
╺ Sid
╺ Id
╺ Position
╺ created_at
╺ created_meta
╺ updated_at
╺ updated_meta
7.
7
But how wecan extract these data fields from JSON? Now let’s select data from the jsonDF
dataframe we created. It looks something like this
8.
8
Now we haveto extract the fields within this data. To do this, let us first create a temporary
view of this dataframe and use explode function to extract Year, Name, County, and gender
fields.
To use explode method, we should first import spark sql functions.
10
Let me showyou the contents of insightData datafrmae using the display method
available in Databricks.
11.
11
Now let uswrite a query to see what is the most popular first letter baby names to start
within each year.
insightData.select("year","name").createOrReplaceTempView("yearname")
val dis=spark.sql("select year,firstLetter,count,ranks from (select year,firstLetter,count
,rank() over (partition by year order by count desc) as ranks from (select year, left(name,1) as
firstLetter, count(1) as count from yearname group by year ,firstLetter order by year
desc,count desc)Y )Z where ranks=1 order by year desc")
Apache Spark Integration
Services
With15+ years in data analytics technology services,
Aegis Softwares Canada expert offers a wide range of
apache spark implementation, integration, and
development solutions also 24/7 support.
14
15.
AEGIS SOFTWARE
CANADA (BranchOffice)
2 Robert Speck Parkway,
Suite 750, Mississauga,
ON Ontario-L4Z1H8,
Canada.
OFSHORE SOFTWARE DEVELOPMENT COMPANY
INDIA (Head Office)
319, 3rd Floor, Golden Plaza,
Tagore Road,
Rajkot – 360001
Gujarat, India
info@aegissoftwares.com www.aegissoftwares.com