DESIGN DATA ARCHITECTURE
• In the contemporary data-driven landscape, a well-
designed data architecture is the bedrock of
successful data analytics.
• It serves as the blueprint for an organization's data
infrastructure, dictating how data is collected,
stored, processed, and consumed to drive insights
and inform decision-making.
• A robust data architecture ensures data quality,
security, and accessibility, enabling businesses to
unlock the full potential of their data assets
Core Principles of Data Architecture Design
• Scalability and Flexibility: The architecture must be able to handle growing volumes
of data and adapt to evolving business requirements and new technologies. This
includes the ability to scale both horizontally (adding more machines) and vertically
(increasing the resources of existing machines).
• Data as a Shared Asset: Viewing data as a shared enterprise resource breaks down
silos and provides a holistic view of the business, leading to improved efficiency and
decision-making.
• Security and Governance: Implementing robust security measures, such as
encryption and access controls, is paramount to protect sensitive data. A strong
data governance framework ensures data quality, integrity, and compliance with
regulations.
• Alignment with Business Goals: The data architecture should be driven by business
needs, not just technological preferences. Understanding the organization's
strategic goals is the first step in designing an effective data architecture.
• Simplicity and Clarity: A well-designed architecture should be easy to understand
and manage. This promotes collaboration and reduces complexity
Key Components of a Modern Data Architecture
• Data Sources: This includes a wide variety of sources from which data is
collected, such as databases, applications, IoT devices, and social media
platforms.
• Data Ingestion: This is the process of bringing data from various sources into a
central storage system. This can be done in batches or in real-time.
• Data Storage: The choice of data storage is a critical architectural decision. The
main options include:
– Data Warehouse: A central repository of structured, filtered data that has been
processed for a specific purpose.Data warehouses are optimized for complex queries
and business intelligence (BI) and are ideal for storing historical data.
– Data Lake: A vast storage repository that holds large amounts of raw data in its native
format.Data lakes are highly flexible and cost-effective, making them suitable for storing
structured, semi-structured, and unstructured data.
– Data Lakehouse: A hybrid approach that combines the flexibility and low-cost storage of
a data lake with the data management and structuring features of a data warehouse.
This architecture supports both BI and machine learning workloads on a single platform.
• Data Processing: This involves transforming the raw data into a
usable format. The two primary methods are:
– ETL (Extract, Transform, Load): Data is extracted from the source,
transformed in a separate staging area, and then loaded into the target
data warehouse.This traditional approach is well-suited for structured
data and ensures data quality before it enters the analytical
environment.
– ELT (Extract, Load, Transform): Raw data is first loaded into the target
system (typically a data lake or data lakehouse), and then transformed
as needed. This modern approach is more flexible, faster for loading,
and better suited for handling large volumes of unstructured data.
• Data Consumption: This is how end-users, such as data analysts
and data scientists, access and analyze the data. This can be
through BI dashboards, reporting tools, or data science notebooks.
Purpose Tools/Technologies
Apache NiFi, Talend, AWS
ETL/ELT
Glue, Airflow
SQL DBs, NoSQL (MongoDB),
Storage
S3, Delta Lake
Apache Spark, Databricks,
Processing
Presto, Flink
Tableau, Power BI, Looker,
BI & Visualization
Superset
Collibra, Alation, Apache Atlas,
Governance
Azure Purview
Designing Your Data Architecture: A Step-by-Step Approach
• Understand Business Requirements: Begin by thoroughly understanding the organization's
goals, key performance indicators (KPIs), and the specific questions that data analytics needs
to answer.
• Identify and Profile Data Sources: Catalog all potential data sources, both internal and
external. Analyze the volume, velocity, and variety of the data from each source.
• Choose the Right Storage Solution: Based on your data characteristics and analytical needs,
select the most appropriate storage architecture (Data Warehouse, Data Lake, or Data
Lakehouse).
• Select a Data Processing Method: Decide whether an ETL or ELT approach is more suitable
for your data integration needs.
• Determine the Architectural Style: Choose an architectural style (e.g., Lambda, Kappa, Data
Mesh) that aligns with your real-time processing requirements and organizational structure.
• Implement Data Governance and Security: Establish clear policies for data quality, access
control, and compliance from the outset.
• Select Appropriate Technologies: Choose the right tools and platforms for each component
of your architecture, considering factors like cost, scalability, and ease of use.
• Design for Consumption: Ensure that the architecture provides intuitive and efficient ways
for users to access and analyze the data.
• Iterate and Evolve: A data architecture is not static. It should be continuously monitored,
evaluated, and adapted to meet changing business needs and technological advancements.
A traditional data warehouse
• A traditional data warehouse is a comprehensive system
that brings together data from different sources within
an organization. Its primary role is to act as a centralized
data repository used for analytical and reporting
purposes.
• Traditional warehouses are physically situated on-
site within your business premises. So, you have to take
care of acquiring the necessary hardware, setting up
server rooms, and employing staff to manage the system.
These types of data warehouses are sometimes called
on-premises, on-prem, or on-premise data warehouses.
.
• Lambda Architecture: This is a hybrid approach that
combines batch and real-time processing to provide both
comprehensive historical views and low-latency real-time
insights. It consists of a batch layer for processing all data, a
speed layer for processing real-time data, and a serving
layer to merge the results.
• Kappa Architecture: This is a simpler alternative to the
Lambda architecture that handles all data processing as a
single stream. It eliminates the complexity of maintaining
separate batch and real-time pipelines, making it a good
choice for use cases where real-time analytics is the primary
focus.
https://medium.com/@FrankAdams7/what-is-
the-difference-between-lambda-and-kappa-
architectures-3806be298089
constraints and influences
• Various constraints and influences will have an
effect on data architecture design. These include
enterprise requirements, technology drivers,
economics, business policies and data processing
needs.
• Constraints and influences on data architecture are
the key factors that shape how a data architecture
is designed, implemented, and managed. These
affect the scalability, performance, cost, security,
and overall success of any data analytics system.
Constraint Description
Limits the choice of tools (e.g., Business Constraints
Budget
open-source vs. enterprise).
Fast delivery may limit
Time-to-market
architectural complexity.
Architecture must align with
Business Goals
KPIs and strategic goals.
GDPR, HIPAA, etc., may
Regulatory Requirements influence how data is stored
and accessed.
Constraint Description Legal & Compliance Constrain
Data must be stored within
Data Residency Laws certain geographies (e.g.,
India, EU).
ISO, SOC2, HIPAA, etc., require
Compliance Standards data traceability and secure
access.
Must ensure logs, lineage, and
Auditability documentation for
compliance.
Organizational Influences
Influence Description
Availability of skilled
Skills & Expertise personnel in Python, Spark,
DBMS, cloud tools.
Data-driven culture influences
Company Culture emphasis on governance and
quality.
Needs of analysts, data
Stakeholder Expectations scientists, and business users
must be balanced.
Teams (centralized vs.
Collaboration Models federated) affect how data is
governed and accessed.
Technical Constraints
Constraint Description
Impacts storage choices (e.g.,
Data Volume
data lake for petabytes).
Structured, semi-structured,
Data Variety and unstructured data need
different models.
Real-time vs. batch processing
Data Velocity influences tool selection (e.g.,
Kafka vs. Hadoop).
Must integrate with legacy
System Compatibility
systems, APIs, and platforms.
Low-latency needs real-time
Latency Requirements architecture (e.g., Kappa, in-
memory DBs).
Architecture must support
Security & Privacy encryption, masking, RBAC,
etc.
V Description Key Tools/Techniques
Volume Scale of data HDFS, Hadoop, S3
Velocity Speed of incoming data Kafka, Spark, Flink
Different formats and
Variety MongoDB, Data Lakes
sources
Trustworthiness and Data cleaning, validation
Veracity
quality rules
Aspect Data Analysis Data Analytics
A broader field that includes data
The process of examining, cleaning,
analysis as one part and focuses on
Definition transforming, and modeling data to
using data to drive decisions and
discover useful information.
predictions.
Narrower – focuses mainly on Broader – includes analysis,
Scope
understanding past data. forecasting, and decision-making.
To extract insights and describe To solve business problems and
Goal
trends or patterns. guide future actions.
"What happened?" and "Why did it "What will happen?" and "What
Focus
happen?" should we do?"
Descriptive, diagnostic, predictive,
Types Descriptive and diagnostic analysis
and prescriptive analytics
Python, R, Tableau, Power BI, SAS,
Tools Used Excel, R, Python (Pandas), SQL
machine learning tools
Data scientists, business analysts,
Users Data analysts, researchers
decision-makers
types of Data Analytics
• Descriptive Analytics: "What happened?"
• Descriptive analytics is the most common and foundational type of data analysis.
It focuses on summarizing historical data to provide a clear and understandable
picture of past events.This type of analysis is often presented in the form of
dashboards, reports, charts, and tables, which help in tracking Key Performance
Indicators (KPIs).
• Purpose: To provide a snapshot of what has already occurred.
• Examples:
– Monthly revenue reports.
– Sales lead overviews.
– Tracking website traffic and engagement metrics.
• What it does: Describes what has happened in the past.
• Purpose: Summarize historical data to understand patterns and trends.
• Example: Monthly sales reports, website traffic reports.
• Tools: Excel, Tableau, Power BI, SQL.
• Diagnostic Analytics: "Why did it happen?"
• Once descriptive analytics has shown what happened, diagnostic analytics delves
deeper to understand the root causes behind those events. It involves examining data
more closely to identify anomalies and uncover the factors that contributed to a
particular outcome. This often requires looking at related data sources and historical
patterns.
• Purpose: To identify the reasons behind past performance.
• Examples:
– Investigating a sudden drop in sales by analyzing customer feedback and competitor activity for
that period.
– Determining why a marketing campaign underperformed by examining the target audience's
demographics and engagement with the ad creatives.
• What it does: Explains why something happened.
• Purpose: Identify the root causes of outcomes.
• Example: Sales dropped – diagnostic analysis may show it was due to a pricing change
or competitor activity.
• Techniques: Drill-down, data mining, correlation analysis.
• Predictive Analytics: "What is likely to happen?"
• Predictive analytics uses historical data to forecast future events.By identifying
patterns and trends from the past, data analysts can create statistical models to
estimate the likelihood of a future outcome. This type of analysis helps businesses
to anticipate future trends and prepare accordingly.
• Purpose: To forecast future outcomes and trends based on historical data.
• Examples:
– Sales forecasting to predict future revenue.
– Risk assessment to identify potential threats.
– Using customer data to predict which customers are likely to churn.
• What it does: Predicts what is likely to happen in the future.
• Purpose: Use past data and trends to forecast outcomes.
• Example: Predicting customer churn, forecasting sales for next quarter.
• Techniques: Machine learning, statistical modeling, regression.
• Tools: Python (Scikit-learn), R, SAS, IBM SPSS.
• Prescriptive Analytics: "What should we do about it?"
• Prescriptive analytics takes the insights from all the previous stages and recommends
specific actions to take to achieve a desired outcome. It goes beyond predicting what will
happen by suggesting the best course of action. This type of analysis often utilizes
machine learning and optimization algorithms to provide data-driven recommendations.
• Purpose: To recommend actions to optimize for a specific goal.
• Examples:
– Suggesting the optimal pricing for a new product to maximize profit.
– Recommending the best marketing channels to invest in to acquire the most valuable customers.
– Providing a production plan to meet forecasted demand while minimizing costs.
• What it does: Suggests actions to take for optimal results.
• Purpose: Recommend the best course of action.
• Example: Recommending inventory reorder points, dynamic pricing.
• Techniques: Optimization, decision trees, simulation, AI.
• Tools: IBM Decision Optimization, Gurobi, Python (SciPy, Optuna
Question Techniques/
Type Focus
Answered Tools
Dashboards,
What
Descriptive Past events reports, BI
happened?
tools
Why did it Drill-downs,
Diagnostic Root cause
happen? data mining
What will ML models,
Predictive Future trends
happen? forecasting
Optimization,
What should Decision-
Prescriptive AI,
we do? making
simulations
manage the data for analysis
• Effective data management for analytics can be broken down into several key
components:
• Data Governance: This forms the blueprint for managing data responsibly. It
involves establishing clear policies, procedures, and roles (like data stewards and
owners) to ensure data is handled consistently and in alignment with
organizational goals. A flexible data governance framework allows for adjustments
as business needs evolve.
• Data Collection and Integration: The first step is to gather data from various
relevant sources. This data can be structured or unstructured. Data integration is
crucial for creating a unified view of information from multiple sources, breaking
down data silos that can hinder analysis.
• Data Preparation and Cleaning: Raw data is often messy and needs to be
prepared for analysis. This involves cleaning the data to ensure accuracy,
transforming it into a suitable format, and handling issues like missing values and
inconsistencies. In fact, data scientists can spend a significant portion of their time
—up to 80%—on data preparation alone.
• Data Storage and Security: A robust data storage strategy is essential.
As data volume grows, scalable solutions like cloud-based platforms
(e.g., Amazon S3, Google BigQuery, Microsoft Azure) become necessary
to handle large datasets efficiently. A common storage methodology is
the 3-2-1 rule: keep three copies of your data on two different types of
storage, with one copy located offsite. Equally important is data
security, which involves implementing measures to protect sensitive
information from unauthorized access and breaches.
• Metadata Management: Metadata is "data about data," providing
context and making it easier to understand, find, and use datasets. It
includes information about the data's content, structure, and
permissions. Effective metadata management promotes collaboration
and ensures the long-term usability of your data.
understand various sources of Data
• Data can be generated from two types of
sources namely Primary and Secondary
Sources of Primary Data The sources of
generating primary data are
• ObservationMethod
• SurveyMethod
• ExperimentalMethod
ObservationMethod
• An observation is a data collection method, by which you gather
knowledge of the researched phenomenon through making
observations of the phenomena, as and when it occurs.
• The main aim is to focus on observations of human behavior, the use
of the phenomenon and human interactions related to the
phenomenon.
• We can also make observations on verbal and nonverbal
expressions. In making and documenting observations, we need to
clearly differentiate our own observations from the observations
provided to us by other people.
• The range of data storage genre found in Archives and Collections, is
suitable for documenting observations e.g. audio, visual, textual and
digital including sub-genres of note taking, audio recording and
video recording.
• We make observations from either the outsider or insider point of view in
relation to the researched phenomenon and the observation technique
can be structured or unstructured.
• The degree of the outsider or insider points of view can be seen as a
movable point in a continuum between the extremes of outsider and
insider.
• If you decide to take the insider point of view, you will be a participant
observer in situ and actively participate in the observed situation or
community. The activity of a Participant observer in situ is called field work.
• This observation technique has traditionally belonged to the data
collection methods of ethnology and anthropology. If you decide to take
the outsider point of view, you try to try to distance yourself from your
own cultural ties and observe the researched community as an outsider
observer
SurveyMethod
• The survey method is one of the most widely used
techniques for collecting primary data in data analytics,
especially for gathering opinions, attitudes, preferences,
and behaviors from a defined population.
• Key Features of the Survey Method
• Collects structured data directly from people.
• Uses a predefined questionnaire or interview.
• Can be quantitative (closed-ended) or qualitative (open-
ended).
• Can be conducted online, offline, face-to-face, or via
phone.
Type Description Example Use Case
Done through forms on Customer satisfaction,
Online Surveys
websites/apps. market research
Interviewer meets Field research, political
Face-to-Face
respondents in person. polling
Service feedback, quick
Telephone Surveys Conducted via phone calls.
opinion polls
Sent and returned through Large-scale governmental
Mail Surveys
postal services. surveys
• Steps in Conducting a Survey
• Define Objective
– What information are you trying to gather?
• Design the Questionnaire
– Choose the right mix of open-ended and closed-ended questions.
– Keep it clear, neutral, and concise.
• Choose the Sample
– Identify your target population.
– Select a representative sample (random, stratified, etc.).
• Distribute the Survey
– Online, phone, email, in-person, etc.
• Collect and Organize Data
– Store responses in databases, spreadsheets, or survey tools.
• Analyze Results
– Use tools like Excel, R, Python, or visualization tools (Tableau, Power BI).
• Interpret and Report Findings
– Convert insights into charts, graphs, and actionable recommendations.
ExperimentalMethod
• CRD - Completely Randomized Design:
• A completely randomized design (CRD) is one where the
treatments are assigned completely at random so that each
experimental unit has the same chance of receiving any one
treatment.
• For the CRD, any difference among experimental units receiving
the same treatment is considered as experimental error. Hence,
CRD is appropriate only for experiments with homogeneous
experimental units, such as laboratory experiments, where
environmental effects are relatively easy to control.
• For field experiments, where there is generally large variation
among experimental plots in such environmental factors as soil, the
CRD is rarely used. CRD is mainly used in agricultural field.
• A randomized block design(RBD)
• the experimenter divides subjects into subgroups
called blocks, such that the variability within blocks is
less than the variability between blocks. Then,
subjects within each block are randomly assigned to
treatment conditions.
• Compared to a completely randomized design, this
design reduces variability within treatment
conditions and potential confounding, producing a
better estimate of treatment effects.
• What is Latin Square Design (LSD)?
• Latin Square Design is an experimental design used when there
are two sources of variability (also called blocking factors)
besides the treatment. It helps to control the variation from both
sources by arranging the experiment in a square layout.
• Key Characteristics:
• Number of treatments = number of rows = number of columns
(a square).
• Each treatment appears exactly once in each row and each
column.
• Controls variation in two directions (e.g., row-wise and column-
wise).
Factorial Design
• A Factorial Design is an experimental setup used to study the effects of
two or more factors (independent variables), each at multiple levels,
on a response variable. It also allows you to study interactions between
the factors.
• Purpose:
• Analyze main effects of each factor.
• Analyze interaction effects between factors.
• Gain more insight from fewer experiments than testing each factor
independently.
• Structure:
• Factors (e.g., Fertilizer Type, Watering Level)
• Each factor has levels (e.g., Fertilizer A/B/C; Water: Low/High)
• All possible combinations of factor levels are tested.
• External Data
• External data is any information that originates from outside the organization This
type of data offers a broader perspective on the market, industry trends, and the
competitive landscape.
• Examples of External Data Sources:
• Government Data: Publicly available datasets from government agencies on topics like
demographics, economics, and public health.
• Market Research Firms: Companies that specialize in collecting and selling data on
consumer behavior, market trends, and industry benchmarks.
• Social Media: Platforms like Twitter and Facebook provide vast amounts of user-
generated content, including opinions, trends, and sentiment data.
• Academic Research: Published studies, articles, and dissertations from various
academic fields can offer in-depth analysis and findings.
• Third-Party Data Providers and APIs: Many companies provide data through APIs
(Application Programming Interfaces) on a wide range of subjects, from weather to
financial markets.
• Examples of Internal Data Sources:
• Sales and Financial Data: This includes revenue figures, sales records, pricing
information, and accounting records.This data is crucial for analyzing profitability
and market performance.
• Customer Relationship Management (CRM) Systems: These systems store a
wealth of information about customers, including their contact details, purchase
history, and interactions with the company.
• Enterprise Resource Planning (ERP) Systems: ERP systems integrate various
business processes and store data related to production, inventory, and human
resources.
• Website and E-commerce Analytics: Data from a company's website, such as
traffic, user behavior, and online transactions, provides insights into customer
engagement.
• Machine-Generated Data: This includes data from sensors on manufacturing
equipment or logs from computer systems, which can be used for operational
efficiency and monitoring.
DATA FROM SENSORS
• Sensor data is at the core of modern IoT (Internet of Things), smart devices,
and industrial systems. Understanding how to collect, process, and analyze this
data is crucial in many fields like agriculture, healthcare, manufacturing, and
transportation. Sensor data refers to the continuous or periodic measurements
collected by physical sensing devices to monitor real-world conditions such as:
• Temperature
• Humidity
• Pressure
• Light intensity
• Motion
• Sound
• Location (GPS)
• Acceleration (via accelerometers)
• Environmental gas levels (CO₂, NOx, etc.)
Characteristics of Sensor Data
Feature Description
Data is usually collected over
Time-Series time (e.g., every
second/minute/hour).
Often needs to be processed
Real-Time
or monitored live.
Can generate large datasets
High Volume quickly, especially with
multiple sensors.
Can contain outliers or faulty
Noisy readings due to environment
or hardware.
Frequently comes as a data
Streaming
stream, not static files.
Steps to Work with Sensor Data
• Data Acquisition
• Collected via:
– Microcontrollers (e.g., Arduino, Raspberry Pi)
– IoT platforms (e.g., AWS IoT, Azure IoT, Google Cloud IoT)
– Gateways for industrial sensors
• Data Transmission
• Using protocols like:
– MQTT, HTTP, CoAP
– Sent to cloud platforms, databases, or edge devices
• Data Storage
• Stored in:
– Time-series databases (e.g., InfluxDB, TimescaleDB)
– Cloud storage (AWS S3, Google Cloud)
– Data lakes/warehouses
• Data Processing
• Real-time or batch:
– Filtering: Remove noise or faulty values
– Smoothing: Use moving averages or median filters
– Normalization: Bring different sensor values into comparable ranges
• Data Analysis
• Use analytics to extract insights:
– Trends, thresholds, predictive models
– Anomaly detection (e.g., sudden temperature spikes)
• Tools: Python (Pandas, NumPy), MATLAB, R, Spark
• Visualization
• Tools: Grafana, Power BI, Tableau, Kibana
• Useful for dashboards, alerts, and performance monitoring
• At its core, a signal is any observable change or
pattern in data that conveys meaningful information
over time or space. Examples include:
• Things like voltage versus time (electronics)
• Prices over time (finance)
• Web activity or engagement metrics (digital
marketing)
A signal doesn’t necessarily have built-in meaning,
but it becomes meaningful when interpreted within
the right context
Domain Signal Type Purpose Tools / Techniques
Electrical, acoustic, Analyze system behavior, Oscilloscopes, spectrum
Engineering
digital design and testing analyzers, ICA, sampling
Web analytics, CRM
Predict buyer behavior or
Business/Marketing Intent or business trigger signals, third-party intent
market changes
data
Signal intercept tools,
Detect threats, monitor
Security/Intelligence Emitted digital signals classification, real-time
anomalies
alerts
• Why Signals Matter
• Early detection: Signals can present early warning—
whether a drop in signal-to-noise ratio degrades
performance, or a change in executive leadership
signals strategic shifts.
• Decision support: Interpreting signals correctly allows
better, more timely decision-making.
• Noise filtering: Especially with weak signals,
separating meaningful insights from noise is difficult
but valuable in forecasting and strategy.
Storing Signals
• Analog to Digital Conversion
• Signal conditioning: Before digitizing, analog
signals are amplified, filtered, scaled, and
noise‑reduced to match ADC requirements (e.g.
anti‑ aliasing, voltage range) .
• Sampling (ADC): Use an analog-to-digital converter
to capture signals as discrete-time, digital samples.
The sampling rate must follow the Nyquist
criterion: at least twice the highest frequency to
avoid aliasing
• Digital Signal Storage Formats
• Raw arrays: Float or integer sequences representing
sample values, often stored as binary data.
• Binary formats: WAV/WavPack, Parquet, or HDF5—
especially for large datasets, time-series analysis, or
big-data processing Signal Processing Stack Exchange
• Compression-based formats: WavPack for audio; also
delta encoding, Huffman coding, JPEG/MPEG for
images/video; or use Parquet for structured columnar
storage with compression and schema
Storage ↔ Processing Strategies Comparison
Scenario Data Storage Approach Processing Strategy
Real-time embedded systems Circular buffers or memory arrays On-the-fly filtering, FFT, adaptive
in DSP platform filters
Chunked file formats (Parquet, Batch processing with Spark,
Offline analytics / ML pipelines HDF5, WAV, CSV) SciPy, MATLAB, TensorFlow
Columnar + compressed Use SQL-like queries, UDFs, Spark
Large-scale time-series storage (Parquet, Parquet on HDFS) jobs
Audio/speech signal storage WAV/WavPack or compressed MFCC, LPC, DTW, speech
audio formats recognition pipelines
• Typical Pipeline
• Acquisition & Conditioning: Raw signal → amplifier, filter → ADC → digital
samples.
• Preprocessing: Noise removal, normalization, artifact suppression.
• Frequency-domain transforms:
– FFT/DFT to convert time-series into frequency components
– Time–frequency analysis: via STFT or wavelets for non-stationary signals, balancing
time–frequency resolution
• Filtering algorithms:
– FIR/IIR filters, using difference equations: y[n] = Σaₖ x[n–k] + Σ bₘ y[n–m] for
smoothing, band filtering, etc.
– Adaptive filters / Wiener filters / Kalman filters for statistical noise suppression or
predictive modeling
• Feature extraction:
– E.g. MFCC in speech/audio, LPC coding, cepstral features for voice/speech tasks
GPS data
• collect GPS data, whether for mapping, tracking, surveying, or real‑time
monitoring, here’s a practical guide to collecting data from GPS devices and apps:
• Choose a Hardware or Software Setup
• a) Dedicated GPS Receiver
• Survey-Grade Receivers: Use RTK or DGPS to achieve centimeter‑level accuracy.
These require a base station or SBAS corrections (like India’s GAGAN) for error
mitigation
• Mid-Consumer GPS Loggers / Handhelds: Devices like Garmin eTrex or InReach
Mini can log tracks in GPX or internal storage; some run for days on AA batteries
and support interval-based logging
• b) Smartphone-Based Collection
• Field-mapping apps such as QField, Avenza, ArcGIS Field Maps, or Fulcrum let
you mark GPS points, draw tracks, and export in formats like CSV, KML, or
shapefiles
• Many apps support connection to external GNSS receivers for improved accuracy
• Logging Strategies :
• Time‑based logging: Store every N seconds (e.g. every 5 s
or 60 s).
• Distance‑based logging: Log when the device moves a
defined distance threshold.
• Dynamic (curve) logging: Retain only points where path
deviation exceeds error threshold for efficient, accurate
representation—used in fleet tracking systems like Geotab
• Angle or speed-based: Log when direction or velocity
changes significantly
• File Formats & Data Output
• GPX (GPS Exchange Format): Common,
lightweight XML used across devices and apps.
• CSV / KML / Shapefiles / GeoPackage:
Standard GIS formats exportable by most field
apps
• RINEX: ASCII format storing raw satellite data
(pseudorange, carrier phase, Doppler) used for
post-processing high-precision surveying
Method Pros Cons Typical Use Case
Dedicated GPS logger Long battery life, GPX Limited accuracy (~5 m) Travel tracking, hiking
export trails
Survey-grade receiver Centimeter-level Costly, needs Surveying, mapping,
(RTK/DGPS) accuracy corrections setup construction
Smartphone + app (+ Easy to use, field- Accuracy depends on Field data collection,
gnss) friendly hardware & conditions surveys
Real-time tracking, Requires infrastructure Fleet, remote
Telemetry beacon + API remote logging or service monitoring
Data management
Area Best Practice Benefits
Standard file/table naming
Improves discoverability,
Naming & Organization (YYYYMMDD, timestamps,
collaboration
department prefixes)
Use rich, searchable metadata for Enables effective lineage and
Metadata & Cataloging
fields, owners, creation context discoverability
Unified data repository or DW with Consistency and integrated
Central Storage
SSOT strategy analytics
Dataset versioning, data pipelines
Version Control Reproducibility, safe rollback
in CI/CD workflows
Automated validation,
Data Quality Assurance High accuracy and trust in insights
deduplication, normalization
Defined roles, RBAC, data policies, Security and regulatory
Governance & Access
encryption, masking compliance
Enable business users with Scales usage and democratizes
Self-Service Tools
governed analytics tools insights
Regular checks on quality, Early detection of issues,
Monitoring & Audits
compliance, access logs governance enforcement