Advanced LabVIEW Programming
Concepts for Multicore Systems
Agenda
Multithreading Overview
Parallel Programming Techniques
Multicore Programming Challenges
Application Examples
Conclusions
Multithreading Overview
Impact on Engineers and Scientists
Engineering and scientific applications are typically on
dedicated systems (i.e. little multitasking).
Creating Multithreaded Applications
Engineers and scientists must use threads to benefit
from multicore processors.
What are Processes and Threads?
Process
Every program executes in a process
Processes provide resources needed to execute
Each process has at least one thread
Thread
Entity within a process that can be executed
Shares resources of the process
Has individual thread resources
How the OS Handles Thread Scheduling
Different OSs use different thread scheduling techniques
With a multicore CPU, the OS will simultaneously execute
different threads in different processors when possible
Waiting
Ready to Run
Thread
Thread
Thread
Thread
Thread
Thread
Thread
Running
Thread
Thread
Core 1
Core 2
Thread
Example of different thread states
How LabVIEW Implements Multithreading
Implicit Parallelism / Threading
Automatic Multithreading using LabVIEW
Execution System
Parallel code paths on block diagram can execute
in unique threads
Explicit Parallelism / Threading
Timed Structures spawn a single thread
Automatic Multithreading in LabVIEW
LabVIEW automatically divides each application into multiple
execution threads (originally introduced in 1998 with LabVIEW 5.0)
Original goal of multithreading was to more elegantly handle
hardware interrupts and allow for a responsive UI
Automatic Multithreading in LabVIEW
An oversimplification of threading in LabVIEW is
shown below
Main idea is parallel code paths will execute in
unique threads
thread
thread
thread
How the LabVIEW Execution System Works
1
thread
thread
thread
thread
1. LabVIEW compiler analyzes diagram and
assigns code pieces to clumps
# of threads scales based
2. Information about which pieces of code
on # of CPUs
can run together are stored in a run queue
3. If block diagram contains enough parallelism, it will
simultaneously execute in all system threads
Multithreading Features in LabVIEW 8.5
Scale # of execution system threads based on
available cores
Improved thread scheduling for LabVIEW timed
loops (to allow multicore support)
Processor Affinity capability with Timed Structures
Real-Time Features:
Support for Real-Time targets with Symmetric Multiprocessing
Real-Time Execution Trace Toolkit 2.0
Deterministic Real-Time Systems
LabVIEW 8.5 adds Symmetric Multiprocessing
(SMP) for real-time systems.
Assigning Tasks to Specific Cores
In LabVIEW 8.5, users can assign code to specific
processor cores using the LabVIEW Timed Loop.
Explicit Threading with Timed Structures
Code within timed
structures will
execute in
precisely 1 thread
(no more)
Can be assigned a
relative priority
Can be used to set
processor affinity
DEMOS
Multithreading in LabVIEW
Implicit Parallelism (Automatic
Multithreading in LabVIEW)
Explicit Parallelism (Control of threads
with Timed Loops)
Parallel Programming Techniques
Parallel Programming in LabVIEW vs. C
LabVIEW Advantages
LabVIEW Disadvantages
Inherent Parallelism
Automatic multithreading
LabVIEW is cross-platform
(Windows, Mac, Linux), no
need to learn different
threading APIs
Support for parallel libraries
such as Intel Math Kernel
Library (MKL)
Just like C, optimization
techniques must be applied to
take full advantage of
multicore there is no silver
bullet
No support for compiler
optimizations with OpenMP
(but NI is working to add
similar capability)
Application Example: Control System for
Autonomous Vehicle
Chose LabVIEW for
multithreading approach
over C
LabVIEW running on two
quad-core HP servers
The ability of LabVIEW to automatically multithread our application, in
addition to the optimizations we performed in the language itself,
drastically reduced our development time.
- Michael Fleming, President of Torc Technologies
Multicore Programming Goals
1. Increase code execution speed (# of FLOPS)
2. Maintain rate of execution but increase data throughput
3. Evenly balance tasks across available CPUs (fair distribution of
processing load)
4. Dedicate time-critical tasks to a single CPU (Real-time use-case)
What are the Trade-offs?
Parallel programming overhead is greater than sequential
Optimizing for speed can come at a cost of more memory
utilization
Example: Unbalanced Parallel Tasks
CPU Core
Task A
Task B
CPU Core
Task E
Task F
CPU Core
CPU Core
Task C
Task D
time
Goal: Balanced Parallel Tasks
CPU Core
CPU Core
CPU Core
CPU Core
Task A
Task C
Task E
Task B
Task F
Task D
time
Application Example: Real-Time
Control
Wind Tunnel Safety-of-Flight
system at NASA Ames
Research Center
Benchmarks Results
Ran on PXI-8106 RT Controller
Time-Critical loop was reduced from
43% CPU load to 30% CPU load on one
CPU of the PXI-8106RT
A core leftover for processing of noncritical tasks.
Image Source: http://windtunnels.arc.nasa.gov
Application Decomposition
First step is to break down program into
core components
Application
Tasks
Data
Data Flow
Choosing the Right Parallel Strategy
Task Paradigm
Best suited for programs with independent functionalities
that have minimal data dependencies
Data Paradigm
Best suited for data operations that are completely
independent
Data Flow Paradigm
Best suited for data that have dependencies and require
prior computation
Parallel Strategies
Task
Paradigm
Application
Data
Paradigm
Data Flow
Paradigm
Task Parallelization
Divide & Conquer
Geometric
Recursive
Pipeline
Wave Front
Task Parallelism
Tasks
Code is comprised of logically independent blocks
of functionality
Divide & Conquer
A section of code that can be decomposed into
parallelized subsections, once completed results
are merged together
Task Parallelism
Not all code requires sequential execution
Isolate independent chunks of code and
mark them as tasks
Task Parallelism
Not all code requires sequential execution
Isolate independent chunks of code and
mark them as tasks
Task A
Task B
Task C
Divide & Conquer
Useful for problems that can be broken into
subsections
Recursive algorithms such as quick sort and
merge sort
Break the problem into manageable segments
according to your resources
Divide & Conquer
problem
split
subproblem
subproblem
split
subproblem
split
subproblem
solve
subproblem
solve
subsolution
subsolution
subproblem
solve
solve
subsolution
subsolution
merge
merge
subsolution
subsolution
merge
solution
Data Parallelism
Geometric Decomposition
No dependencies in data
Could be completely parallelized if enough
resources were available
Recursive Structure
Similar to Divide & Conquer strategy. Data is
inherently recursive and can be split up into
parallelized subsets
Data Parallelism
You can speed up processor-intensive operations
on large data sets by segmenting the data.
Data Set
CPU Core
CPU Core
Signal Processing
CPU Core
CPU Core
Result
Data Parallelism
You can speed up processor-intensive operations
on large data sets by segmenting the data.
Data Set
CPU Core
Signal Processing
CPU Core
Signal Processing
CPU Core
Signal Processing
CPU Core
Signal Processing
Combine
Results
Application Example: High-Speed Control
Max Planck Institute (Munich, Germany)
Plasma control in nuclear fusion tokamak with
LabVIEW on 8-core system using data parallelism
technique
with LabVIEW, we obtained a 5X processing speed-up on an
octal-core processor machine over a single-core processor
Dr. Louis Giannone
Lead Project Researcher
Max Planck Institute
Data Flow Parallelism
Pipelining
Data must go through a series of computations
(think of an automobile assembly line)
Wave Front
Data has dependencies but can be computed if
prior elements are computed
Pipelining
Many applications involve sequential,
multistep algorithms
Applying pipelining can increase performance
2
Acquire
1
Filter Analyze Log
3
Acquire
1
Filter Analyze Log
3
time
t0
t3
t4
t7
Pipelining Strategy
CPU Core
Acquire
Filter
CPU Core
Analyze
CPU Core
Log
CPU Core
t0
t1
t2
t3
time
Pipelining Strategy
CPU Core
Acquire
Acquire
Filter
CPU Core
Filter
Analyze
CPU Core
Analyze
Log
CPU Core
t0
t1
t2
t3
Log
time
Pipelining Strategy
CPU Core
Acquire
CPU Core
Acquire
Acquire
Acquire
Filter
Filter
Filter
Filter
Analyze
Analyze
Analyze
Analyze
Log
Log
Log
CPU Core
CPU Core
t0
t1
t2
t3
Log
time
Pipelining in LabVIEW
Sequential
Pipelined
?
Note: Queues may also be
used to pipeline data between
different loops
or
Key Considerations for Pipelining
Consider # of Processor Cores when determining # of
pipeline stages
Be sure to balance stages since longest stage will be
limiting factor in performance speed-up
Example:
Task A
Unbalanced
Pipeline
Task C
Task B
Application Example: Communications Test
AmFax Ltd. (United Kingdom)
Created wireless test systems for next-generation
phones using pipelined LabVIEW architecture based
on pipelining technique
With LabVIEW and the dualcore embedded controller we are
achieving up to 5x speed savings.
Mark Jewell
BDM Wireless
AmFax Ltd.
Tips for Balancing Pipeline Stages
Use LabVIEW Benchmarking techniques
Perform basic benchmarking with
timestamps and VI Profiler
Wave Front
Dependencies exist
in elements of the
data structure
For instance, the value
at (i,j) requires
computed value at
(i-1,j-1)
Wave Front
As long as the dependencies are satisfied
multiple operations can be carried out on
data
Wave Front
Wave Front effect appears as parallel
executions iterate over data
1
Wave Front
Practical Applications
Error Diffusion for Black & White Printers
Image Processing & Filtering
Predicting Speed Increase Amdahls Law
Theory used to calculate the expected speedup of a
parallelized implementation (Relative to serial).
Pk = % of instructions affected
Sk = Speed increase factor
K = Section of code label
N = Total number of sections
Stotal = Total speed change factor
DEMOS
Parallel Programming Techniques
Divide and Conquer/Recursion
Data Parallelism
Pipelining
Amdahls Law
Multicore Programming
Challenges
Multicore Programming Challenges
Programming Caveats
Thread Synchronization
Race Conditions
Deadlocks
Potential Code Bottlenecks
Debugging
More things happening at one time
Memory
Data transfer between processor cores and cache considerations
Thread Synchronization
With OS scheduling, there is no guarantee when
threads will execute without using synchronization
primitives
Order of events may change at each execution due
to the way the threads are scheduled
First execution
Thread
1
Thread
2
Thread
3
Second execution
Thread
2
Thread
3
Thread
1
Third execution
Thread
3
Thread
1
Thread
2
Race Conditions
This issue occurs when threads manipulate shared resources
simultaneously
Common problem when code is migrated from single CPU
system to multicore (software was not originally created for
multicore)
No synchronization utilized between threads, results in
anomalous behavior
Ex: Two threads simultaneously writing to one memory
location
Thread 1 Data
Thread 2 Data
Deadlocks and Thread Starvation
Deadlock occurs when two or more threads wait for
resources occupied by another thread in the same
group
Thread deadlock is
analogous to traffic
deadlock, nothing can
proceed
Thread Starvation occurs when a thread is
perpetually denied the resources it needs to execute
Addressing Multicore Challenges with
LabVIEW
Dataflow paradigm is a key
benefit for multicore
programming
Helps synchronize threads
and prevent race
conditions/deadlocks
thread
thread
Parallel code paths are
synchronized and order of
execution is determined by
LabVIEW wires
Synchronization in LabVIEW
When more synchronization
is required, use
synchronization mechanisms:
Notifiers
Queues
Semaphores
Rendezvous
Occurrences
Potential Bottlenecks: Shared Resources
1. Data Dependencies
Example: Data stored in global and shared
variables that need to be accessed by different VIs
would be a shared resource
2. Hard Disk
Example: Computers can only read or write to
the hard disk one item at a time (File I/O cannot
be made into parallel operations)
3. Blocking Items
Example: Non-reentrant VIs
4. Entire software stack must support multicore
Non-reentrant VIs
VIs that are non-reentrant cannot be called simultaneously,
one call runs and the other call waits for the first to finish
before running
To make a VI reentrant, select FileVI Properties, select the Execution
category and choose Reentrant execution.
Multithreaded Software Stack Support
Software Stack
Development tool Support provided on the
operating system of choice; tool
facilitates correct threading and
optimization
Example: Multithreaded
nature of LabVIEW and
structures that provide
optimization
Libraries
Thread-safe, re-entrant libraries
Example: BLAS libraries
Device drivers
Drivers designed for optimal
multithreaded performance
Example: NI-DAQmx driver
software
Operating system Operating system supports
multithreading and multitasking
and can load balance tasks
Example: Support for
Windows, Mac OS, Linux OS,
and real-time operating
systems
Application Example: Eaton Corporation
Eaton created a portable in-vehicle test system for
truck transmissions using LabVIEW.
Acquired and analyzed 16 channels on single core using
DAQmx multithreaded driver
Now acquire and analyze 80+ channels on multicore
There was no need to rewrite our
application for the new multicore
processing platforms.
Scott Sirrine
Lead Design Engineer
Eaton Truck Division
Debugging Methods
Functional Debugging
Trace Debugging
Performance Counters
Functional Debugging
LabVIEW supports debugging parallel code for
functional correctness
Use basic LabVIEW debugging tools (highlight
execution, probes, etc.) to ensure code is
functionally correct
Trace Debugging
On real-time systems, trace debugging can show thread
activity at the OS level
Thread activity on each core is displayed by selecting a
particular CPU
Performance Counters
Performance counters provide
detailed system information such as
CPU usage, memory usage, and cache
hits/misses
LabVIEW does not natively support
performance counters but can call
Windows counters programmatically
Example utilities for performance
counting include:
Windows Perfmon
Intels VTune
DEMOS
Debugging Methods
Functional Debugging
Trace Debugging
Performance Counters
Memory Considerations
Data transfer between cores
Cache considerations
Data Transfer Between Cores
Physical distance between processor and the
quality of the processor connections can have
large effect on execution speed
Cache Considerations
Multicore processors
typically utilize a shared
cache
Common cache problem is
false sharing where two
cores write to the same
cache line and cause
performance degradation
For cache optimization, use
processor affinity
Cache Optimization with Processor Affinity
Setting processor affinity tells the OS which processor to
execute the code on
Processor affinity can prevent OS from scheduling threads in a
configuration that hurts cache usage
Application Examples
Example #1 Multi-channel Acquisition
and Signal Processing
Overview: Two channels are read from a digitizer and an
FFT operation is performed on the data.
2 Channels
from a
Digitizer
Recommendation
1. Read data channels separately
2. Perform FFT operations in parallel
Example #2 Operations on Large Data Sets
Overview: Multiplication of two matrices puts heavy load on
processor, especially when the matrices are large
Multiply
Matrices
Matrix 1
Matrix 2
Result
Recommendation
Split data into subsets and then perform the operation.
Example #3: Multi-loop Producer /
Consumer Architecture with Queues
Multiloop architectures use queues to pass data between
parallel loops
Queues allow each
loop to run at an
optimal rate
In this example, the
acquisition is not
slowed by the write to
disk task
Recommendation
Balance acquisition rate and processing rate for
maximum throughput
Acquire from Scope
Dequeue
Element
Data Decomposition
Digital Output
7th Order
Low-pass filter
Enqueue
Element
DEMO
Multiloop Producer / Consumer
Conclusions
Multithreading - LabVIEW offers implicit parallelism
(automatic multithreading) and explicit parallelism
(timed structures)
Parallel Programming Techniques - There is no silver
bullet, the advantage of LabVIEW is that parallelism is
much easier expressed by the language
Conclusions (continued)
Multicore Programming Challenges - Debugging and
memory considerations have evolved with multicore
and play an important role
Application Examples - With minor modifications,
typical LabVIEW applications can be optimized for
multicore
Resources
www.ni.com/multicore
How to Develop
your LabVIEW skills
Fast-Track to Skill Development
New User
Experienced User
Courses
LabVIEW Advanced I
Core Courses
Begin
Here
LabVIEW
Basics I
Advanced User
LabVIEW
Basics II
Certifications
Certified LabVIEW
Associate Developer Exam
LabVIEW
Intermediate I
LabVIEW
Intermediate II
Certified LabVIEW
Developer Exam
If you are unsure
-Quick LabVIEW quiz
-Fundamentals exam
ni.com/training
Certified LabVIEW
Architect Exam
Certification
Training Membership the
Flexible Option
Ideal for developing your skills with NI
products
12 months to attend any NI regional or online
course and take certification exams
$4,999 for a 12 month membership in the USA
Next Steps
Visit ni.com/training
Identify your current expertise level and
desired level
Register for appropriate courses
$200 discount for attending LV Dev Day event!