User Manual
User Manual
USER MANUAL
Manual for
CLC Main Workbench 25.0.3
Windows, macOS and Linux
I Introduction 11
II Core Functionalities 40
2 User interface 41
2.1 View Area . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
2.2 Zoom functionality in the View Area . . . . . . . . . . . . . . . . . . . . . . . . . 53
2.3 Toolbox panel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
2.4 Processes tab and Status bar . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
2.5 History and Element Info views . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
2.6 Workspace . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
2.7 List of shortcuts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
3
CONTENTS 4
5 Printing 102
5.1 Selecting which part of the view to print . . . . . . . . . . . . . . . . . . . . . . 103
5.2 Page setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
5.3 Print preview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
12 Metadata 172
12.1 Creating metadata tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
12.2 Associating data elements with metadata . . . . . . . . . . . . . . . . . . . . . 180
12.3 Working with data and metadata . . . . . . . . . . . . . . . . . . . . . . . . . . 184
12.4 Moving, copying and exporting metadata . . . . . . . . . . . . . . . . . . . . . . 191
13 Workflows 193
13.1 Creating and editing workflows . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
13.2 Workflow elements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215
13.3 Launching workflows individually and in batches . . . . . . . . . . . . . . . . . . 240
13.4 Advanced workflow batching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252
13.5 Template workflows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255
13.6 Managing workflows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257
IV Appendix 671
Bibliography 695
Part I
Introduction
11
Chapter 1
Contents
1.1 Contact information and citation . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.2 Download and installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.2.1 General information about installing and upgrading Workbenches . . . . 15
1.2.2 Installation on Microsoft Windows . . . . . . . . . . . . . . . . . . . . . 16
1.2.3 Installation on macOS . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.2.4 Installation on Linux with an installer . . . . . . . . . . . . . . . . . . . . 17
1.3 System requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.3.1 Limitations on maximum number of cores . . . . . . . . . . . . . . . . . 19
1.4 Workbench Licenses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.4.1 Request an evaluation license . . . . . . . . . . . . . . . . . . . . . . . 20
1.4.2 Download a license using a license order ID . . . . . . . . . . . . . . . . 22
1.4.3 Import a license from a file . . . . . . . . . . . . . . . . . . . . . . . . . 24
1.4.4 Upgrade license . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
1.4.5 Configure license manager connection . . . . . . . . . . . . . . . . . . . 28
1.4.6 Viewing or updating license information . . . . . . . . . . . . . . . . . . 32
1.4.7 Download a static license on a non-networked machine . . . . . . . . . . 32
1.4.8 Viewing mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
1.4.9 Start in safe mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
1.5 Plugins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
1.5.1 Install . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
1.5.2 Uninstall . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
1.5.3 Updating plugins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
1.6 Network configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
Welcome to CLC Main Workbench 25.0.3 --- a software package supporting your daily bioinformatics
work.
The CLC Main Workbench provides an easy-to-use graphical interface for running bioinformatics
analyses. Tools can be run individually, or chained together in a workflow, making running
complex analyses simple and efficient. The functionality of the CLC Main Workbench can also be
extended using plugins. The built-in Plugin Manager provides an up-to-date listing. A list is also
12
CHAPTER 1. INTRODUCTION TO CLC MAIN WORKBENCH 13
• The built-in Workbench user manual can be opened by choosing the Help option or by
clicking on the F1 key.
• Manuals for installed plugins can be accessed under the Plugin Help option.
• The Online Tutorials option opens our tutorials webpage in a browser. Tutorials offer
hands-on examples of how to use features of the CLC Main Workbench. Alternatively, click
on the following link to visit that webpage: https://digitalinsights.qiagen.com/
support/tutorials/.
Watch product specialists demonstrates our software in the videos offered via our Online
presentations area: https://tv.qiagenbioinformatics.com/.
The latest version of this user manual can be found in pdf and html formats at https:
//digitalinsights.qiagen.com/technical-support/manuals/
The CLC Main Workbench is being constantly developed and improved. A detailed list of new fea-
tures, improvements, bug fixes, and changes for the current version of CLC Main Workbench can
be found at https://digitalinsights.qiagen.com/technical-support/latest-
improvements/.
Disclaimer: CLC software is intended for scientific research applications. CLC software is not
intended for the diagnosis, prevention or treatment of a disease.
The QIAGEN Aarhus team is continuously improving CLC Main Workbench with your interests
in mind. We welcome all requests and feedback from users, as well as suggestions for new
features or more general improvements to the program.
Getting help via the Workbench If you encounter a problem or need help understanding
how CLC Main Workbench works, and the license you are using is covered by our Mainte-
nance, Upgrades and Support (MUS) program (https://digitalinsights.qiagen.com/
CHAPTER 1. INTRODUCTION TO CLC MAIN WORKBENCH 14
Figure 1.1: Contact our Support team by clicking on the button at the right hand side of the top
Toolbar
This will open a dialog where you can enter your contact information, and a text field for writing
the question or problem you have. On a second dialog you will be given the chance to attach
screenshots or even small datasets that can help explain or troubleshoot the problem. When you
send a support request this way, it will automatically include helpful technical information about
your installation and your license information so that you do not have to look this up yourself.
Our support staff will reply to you by email.
Other ways to contact the support team You can also contact the support team by email:
ts-bioinformatics@qiagen.com
Please provide your contact information, your license information, some technical information
about your installation , and describe the question or problem you have. You can also attach
screenshots or even small data sets that can help explain or troubleshoot the problem.
Information about the license(s) being used by a CLC Workbench and any installed modules can
be found by opening the License Manager:
Help | License Manager...
Information about MUS cover on particular licenses is provided in your myCLC account: https:
//secure.clcbio.com/myclc/login.
How to cite us To cite a CLC Workbench or Server product, use the name of the product,
the version number. For example QIAGEN CLC Main Workbench 24.0 or QIAGEN CLC Genomics
Workbench 24.0. If a location is required by the publisher of the publication, use (QIAGEN,
Aarhus, Denmark). Our website is https://digitalinsights.qiagen.com/.
Further details about citing QIAGEN Digital Insights software can be found in our FAQ at
https://qiagen.my.salesforce-sites.com/KnowledgeBase/KnowledgeNavigatorPage?id=kA41i000000L63hCAC
More ways to get installers are described in the Frequently Asked Question entry "Where can I get
installer files for QIAGEN CLC software?": https://qiagen.my.salesforce-sites.com/KnowledgeBase/
KnowledgeNavigatorPage?id=kA41i000000L5uQCAS
To check for available updates from within the software, go to the menu option: Help | Check for
Updates... ( ).
General information about running software installers, including differences between upgrading to
a new minor version compared to upgrading to a new major version, are covered in section 1.2.1.
Detailed instructions for running the software installer in interactive mode on each supported
operating system then follows.
Information about running the software installers in console mode and silent mode are provided
in the Workbench Deployment manual at https://resources.qiagenbioinformatics.com/manuals/
workbenchdeployment//current/index.php?manual=Installation_modes_console_silent.html.
1. Extracts and copies files to the installation directory The Workbench software is installed
into a directory. It is self contained. The suggested folder name to install into reflects the
software name and the major version line. For example, for a CLC Genomics Workbench
with major version 25, the default installation location offered on each platform would be:
To install the software into central locations, like those listed above, generally requires
administrator rights. Administrator rights will also be needed to install licenses and plugins
for installations in central locations. The software can be installed to another location, if
desired. When only a single person will use the software, this can be useful. Installing
to an area they have permission to write to means that licenses and plugins can then be
installed without needing administrator rights.
General recommendations for installation locations
• For minor updates, you will be asked whether you wish to:
Update the existing installation Generally recommended for minor updates. New
files will be installed into the same directory as the existing installation. Licensing
information and installed plugins remain in place from the installation already
present.
OR
Install to a different directory. Configuration will be needed after installation. E.g.
licensing needs to be configured, any desired plugins will need to be installed,
etc.
• For major updates. The suggested installation directory will reflect the new major
version number of the software. Please do not install a new major version into the same
folder as an existing, older version of the Workbench. Configuration will be needed after
CHAPTER 1. INTRODUCTION TO CLC MAIN WORKBENCH 16
installation. E.g. licensing needs to be configured, any desired plugins will need to be
installed, etc.
2. Sets the amount of memory The installer investigates the amount of RAM on the machine
during installation and sets the amount of memory that the Workbench can use.
On Macs without Rosetta present on the system, the option of installing it is offered during the
installation process. Rosetta enables Intel-based features to run on Apple Silicon Macs. While
not needed for the majority of tools, some require it, for example De Novo Assembly, BLAST,
Sample Reads and tools for analyzing small RNA.
Updating workflows after upgrading a CLC Workbench is described in section 13.6.1.
• Unless you are installing a minor update to the same folder as an existing installation,
you will be prompted to choose where you would like to install the Workbench. If you
are upgrading from an earlier version, please refer to section 1.2.1 for information about
installing to an existing or different directory. Click on Next.
• Choose where you want the program's shortcuts to be placed. Click on Next.
• Choose if you would like to associate .clc files to the CLC Main Workbench. If you check
this option, double-clicking a file with a "clc" extension will open the CLC Main Workbench.
• Choose if a desktop icon should be created, and choose whether clc://URLs should be
opened by this program by default. Click on Next.
• Wait for the installation process to complete, and then choose whether you would like to
launch CLC Main Workbench right away. Click on Finish.
When the installation is complete the program can be launched from the Start Menu or from one
of the shortcuts you chose to create.
CHAPTER 1. INTRODUCTION TO CLC MAIN WORKBENCH 17
• Choose where you would like to install the application. If you are upgrading from an earlier
version, please refer to section 1.2.1 for information about installing to an existing or
different directory. Click on Next.
• Specify other options associated with the installation such as whether a desktop icon
should be created, whether the software should open clc:// URLs. whether .clc files should
be associated with the software and whether it should be added to the dock. Click on Next.
• Wait for the installation process to complete, choose whether you would like to launch CLC
Main Workbench right away, and click on Finish.
On Apple Silicon Macs without Rosetta present on the system, the option of installing it is offered
during the installation process. Rosetta enables Intel-based features to run on Apple Silicon
Macs. While not needed for the majority of tools, some require it, for example De Novo Assembly,
BLAST, Sample Reads and tools for analyzing small RNA.
When the installation is complete, the program can be launched from the dock, if present there,
or by clicking on the desktop shortcut if you chose to create one. The software can also be
launched from within the installation folder.
# sh CLCMainWorkbench_25_0_3_64.sh
To install to a central location such as /opt or /usr/local, you will normally need to run the above
command using sudo. If you do not have sudo privileges you can choose to install in your home
directory, or any other location you have write permission for.
Then walk through the following steps. (The exact order options are presented may differ to that
described.)
• Choose where you would like to install the application. If you are upgrading from an earlier
version, please refer to section 1.2.1 for information about installing to an existing or
different directory. Click on Next.
CHAPTER 1. INTRODUCTION TO CLC MAIN WORKBENCH 18
• Choose where you would like to create symbolic links to the program. Click on Next.
DO NOT create symbolic links in the same location as the application.
Symbolic links should be installed in a location which is included in your environment PATH.
For a system-wide installation you can choose for example /usr/local/bin. If you do not
have root privileges you can create a 'bin' directory in your home directory and install
symbolic links there. You can also choose not to create symbolic links.
If you choose to create symbolic links in a location which is included in your PATH, the program
can be executed by running the command:
# clcmainwb25
Otherwise you start the application by navigating to the location where you choose to install it
and running the command:
# ./clcmainwb25
• Windows: Supported versions of Windows 10, Windows 11, Windows Server 2016, Windows
Server 2019, Windows Server 2022 and Windows Server 2025
• Linux: RHEL 8 and later and supported versions of SUSE Linux Enterprise Server 12.5 and
later. The software is expected to run without problem on other recent Linux systems, but
we do not guarantee this. To use BLAST related functionality, libnsl.so.1 is required.
• 1 GB RAM required
• 2 GB RAM recommended
The options available in the License Assistant window are described in brief below, and then in
detail in the sections that follow.
• Download a license Use the license order ID provided when you purchase the software to
download and install a static license file.
• Import a license from a file Import an existing static license file, for example a file
downloaded from the license download webpage.
• Upgrade from an existing Workbench installation If you have used a previous version of
the CLC Main Workbench, and you are entitled to upgrade to a new major version, select
this option to upgrade your static license file.
• Configure license manager connection If your organization has a CLC Network License
Manager, select this option to configure the connection to it.
Select the appropriate option and then click on the Next button.
To use the Request an evaluation license, Download a license or the Upgrade from an existing
Workbench installation options, your machine must be able to access the external network. If
this is not the case, please see section 1.4.7.
When using a CLC Main Workbench installed in a central location on your system, you must be
running the program in administrative mode to license the software. On Linux and Mac, this
means you must be logged in as an administrator. On Windows, you can right-click the program
shortcut and choose "Run as Administrator".
If you do not have a license order ID or access to a license, you can still use the Workbench in
Viewing Mode. See section 1.4.8 for further information about this.
Note: Static licenses are tied to the host ID of the machine they were downloaded to. If your
license is covered by Maintenance, Upgrades and Support (MUS), please contact our Support
team (ts-bioinformatics@qiagen.com) if you need to start using a different machine for working
with the CLC Main Workbench.
• Direct Download. Download the license directly. This method requires that the Workbench
has access to the external network.
• Go to CLC License Download web page. The online license download form will be opened
in a web browser. This option is suitable for when downloading a license for use on another
machine that does not have access to the external network, and thus cannot access the
QIAGEN Aarhus servers.
After selecting your method of choice, click on the button labeled Next.
CHAPTER 1. INTRODUCTION TO CLC MAIN WORKBENCH 21
Figure 1.3: Choose between downloading a license directly, or opening the license download form
in a web browser.
Direct download
After choosing the Direct Download option and clicking on the button labeled Next, a dialog
similar to that shown in figure 1.4 will appear if the license is successfully downloaded and
installed.
Figure 1.4: A license has been successfully downloaded and installed for use.
When the license has been downloaded and installed, the Next button will be enabled.
If there is a problem, a dialog will appear indicating this.
Figure 1.6: Importing the license file downloaded from the web page.
Figure 1.7: Enter a license order ID into the text field and then click on the Next button.
• Direct Download. Download the license directly. This method requires that the Workbench
CHAPTER 1. INTRODUCTION TO CLC MAIN WORKBENCH 23
• Go to CLC License Download web page. The online license download form will be opened
in a web browser. This option is suitable for when downloading a license for use on another
machine that does not have access to the external network, and thus cannot access the
QIAGEN Aarhus servers.
After selecting your method of choice, click on the button labeled Next.
Direct download
After choosing the Direct Download option and clicking on the button labeled Next, a dialog
similar to that shown in figure 1.8 will appear if the license is successfully downloaded and
installed.
Figure 1.8: A license has been successfully downloaded and installed for use.
When the license has been downloaded and installed, the Next button will be enabled.
If there is a problem, a dialog will appear indicating this.
Click on the Download License button and then save the license file.
Back in the Workbench window, you will now see the dialog shown in 1.10.
CHAPTER 1. INTRODUCTION TO CLC MAIN WORKBENCH 24
Figure 1.10: Importing the license file downloaded from the web page.
Click on the Choose License File button, find the saved license file and select it. Then click on
the Next button.
Click on the Choose License File button, locate the license file and selected it. Then click on the
Next button.
text I accept these terms. If further information is requested from you, please fill this in before
clicking on the Finish button.
When you click on the Next button, the Workbench checks if you are entitled to upgrade your
license. This is done by contacting QIAGEN Aarhus servers.
If the earlier Workbench version could not be found, which can be the case if you have installed
to a custom location or are upgrading from one Workbench product to another product replacing
it1 , then click on the "Choose a different License File" button. Navigate to where the older license
file is, which will be in a subfolder called "licenses" within the installation area of the Workbench
you are upgrading from. Select the license file and click on the "Open" button.
1
In November 2018, the Biomedical Genomics Workbench was replaced by the CLC Genomics Workbench and a
free plugin, Biomedical Genomics Analysis. Licenses for the Biomedical Genomics Workbench covered by MUS at that
time can be used to download a valid license for the CLC Genomics Workbench, but the upgrade functionality is not
able to automatically find the older license file.
CHAPTER 1. INTRODUCTION TO CLC MAIN WORKBENCH 26
If the license selected can be updated, a message similar to that shown in figure 1.13 will be
displayed. If there is a problem updating the selected license, a dialog will appear indicating this.
Click on the Next button and then choose how to proceed to get the updated license file.
In this dialog, there are two options:
• Direct Download. Download the license directly. This method requires that the Workbench
has access to the external network.
• Go to CLC License Download web page. The online license download form will be opened
in a web browser. This option is suitable for when downloading a license for use on another
machine that does not have access to the external network, and thus cannot access the
QIAGEN Aarhus servers.
After selecting your method of choice, click on the button labeled Next.
Direct download
After choosing the Direct Download option and clicking on the button labeled Next, a dialog
similar to that shown in figure 1.14 will appear if the license is successfully downloaded and
installed.
When the license has been downloaded and installed, the Next button will be enabled.
If there is a problem, a dialog will appear indicating this.
Figure 1.14: A license has been successfully downloaded and installed for use.
Back in the Workbench window, you will now see the dialog shown in 1.16.
Figure 1.16: Importing the license file downloaded from the web page.
Click on the Choose License File button, find the saved license file and select it. Then click on
the Next button.
• Enable license manager connection. This box must be checked for the Workbench is to
contact the CLC Network License Manager to get a license for the CLC Main Workbench.
• Automatically detect license manager. By checking this option the Workbench will look
for a CLC Network License Manager accessible from the Workbench. Automatic server
discovery sends UDP broadcasts from the Workbench on port 6200. Available license
servers respond to the broadcast. The Workbench then uses TCP communication for to get
a license, if one is available. Automatic server discovery works only on local networks and
will not work on WAN or VPN connections. Automatic server discovery is not guaranteed to
work on all networks. If you are working on an enterprise network on where local firewalls
or routers cut off UDP broadcast traffic, then you may need to configure the details of the
CLC Network License Manager using the Manually specify license manager option instead.
• Manually specify license manager. Select this option to enter the details of the machine
the CLC Network License Manager is running on, specifically:
CHAPTER 1. INTRODUCTION TO CLC MAIN WORKBENCH 29
Host name. The address of the machine the CLC Network License Manager is running
on.
Port. The port used by the CLC Network License Manager to receive requests.
• Use custom username when requesting a license. Optional. When unchecked (the default),
the username of the account being used to run the Workbench is the username used when
contacting the license manager. When this option is checked, a different username can
be entered for that purpose. Note that borrowing licenses is not supported with custom
usernames.
• Disable license borrowing on this computer. Check this box if you do not want users of
this Workbench to borrow a license. See section 1.4.5 for further details.
Borrowing a license
A CLC Main Workbench using a network license normally needs to maintain a connection to the
CLC Network License Manager. However, if allowed by the network license administrator, network
licenses can be borrowed for offline use for a period of time. While the license is borrowed, there
is one less network license available for other users. Borrowed licenses can be returned early.
The Workbench must be connected to the CLC Network License Manager at the point when the
license is borrowed or returned. The procedure for borrowing a license is:
2. Click on the "Borrow License" tab to display the license borrowing settings (figure 1.18).
3. Select the license(s) that you wish to borrow by clicking in the checkboxes in the Borrow
column in the License overview panel.
If you plan to borrow module licenses but they are not listed, start a job that requires that
module. This will check out the relevant module license, so that it becomes available to
borrow.
4. Choose the length of time you wish to borrow the license(s) for using the drop down
list in the Borrow License tab. By default the maximum is 7 days, but network license
administrators can specify a lower limit than this.
You can now go offline and continue working with the CLC Main Workbench. When the time period
you borrowed the license for has elapsed, the network license will be again made available for
other users. To continue using CLC Main Workbench with a license, you will need to connect to
the network again so the Workbench can request another license.
You can return borrowed licenses early opening up the License Manager, going to the "Borrow
License" tab, and clicking on the Return Borrowed Licenses button.
Figure 1.19: When there are no available network licenses for the software, a message appears to
indicate this.
After at least one license is returned to the pool, you will be able to run the software and
get the necessary license. If running out of licenses is a frequent issue, you may wish to
discuss this with your administrator.
Data can be viewed, imported and exported, and very basic analyses launched, by running
the Workbench in Viewing Mode. Click on the Viewing Mode button in that dialog to launch
the Workbench in this mode.
Figure 1.20: This Workbench was unable to establish a connection to obtain a network license.
If you have chosen the option to Automatically detect license manager and you have not
succeeded in connecting to the CLC Network License Manager before, please check with
your local IT support that automatic detection is possible at your site. If it is not, you will
need to specify the settings, as described earlier in this section.
If you have successfully contacted the CLC Network License Manager from your Workbench
previously, please contact your local administrator. Common issues include that the CLC
Network License Manager is not running or that network details have changed.
CHAPTER 1. INTRODUCTION TO CLC MAIN WORKBENCH 32
Figure 1.21: License information and license-related functionality is available in the Workbench
License Manager.
• See information about its license (e.g. the license type, when it expires, etc.)
• Configure the connection to a CLC Network License Manager. Click on the Configure
Network License button at the lower left corner to open the dialog seen in figure 1.17.
• Upgrade from an evaluation license. Click on the Upgrade Workbench License button to
open the dialog shown in figure 1.2.
• Borrow a license from a CLC Network License Manager when network licenses are in use.
If you wish to switch away from using a network license, click on the button to Configure Network
License and uncheck the box beside the text Enable license manager connection in the dialog.
When you restart the Workbench, you can set up the new license as described in section 1.4.
• Install the CLC Main Workbench on the machine you wish to run the software on.
• Start up the software as an administrative user and find the host ID of the machine that
you will run the CLC Workbench on. You can see the host ID of the machine at the bottom
of the License Assistant window in grey text, or, if working in Viewing Mode, by launching
the License Manager from under the Workbench Help menu option.
• Make a copy of this host ID such that you can use it on a machine that has internet access.
• Go to a computer with internet access, open a browser window and go to the network
license download web page:
https://secure.clcbio.com/LmxWSv3/GetLicenseFile
• Paste in your license order ID and the host ID that you noted down in the relevant boxes on
the web page.
• Click on 'Download License' and save the resulting .lic file.
• Open the Workbench on your non-networked machine. In the Workbench license manager
choose 'Import a license from a file'. In the resulting dialog click on the 'Choose License
File' button and then locate and selct the .lic file you have just downloaded.
If the License Manager does not start up by default, you can start it up by going to the
menu option:
Help | License Manager ( )
• Click on the Next button and go through the remaining steps to install the license.
Data viewing
Any data type supported by the Workbench being used can be viewed in Viewing Mode. Plugins
or modules can also be installed when in Viewing Mode, expanding the range of data types
supported.
Viewing Mode of the CLC Workbenches can be particularly useful when sharing data with
colleagues or reviewers who wish to view and investigate data you have generated but who do
not have access to a Workbench license.
Figure 1.22: Bioinformatics tools available when using Viewing Mode are found under the Tools
menu.
visible in message windows that appear if a Workbench is started up that has an expired license
or that is configured to use a network license but all the available licenses have been checked
out by others, as described in section 1.4.5.
Click on the Viewing Mode button to start up the Workbench in Viewing Mode.
To go from running in Viewing Mode to running a Workbench with its full functionality, it just
needs to have access to a valid license. This can be done by installing a static license, or when
using a network license, by restarting the Workbench when licenses are once again available.
Figure 1.23: Click on the Viewing Mode button at the bottom of the License Manager window to
launch the Workbench in Viewing Mode.
1.5 Plugins
The functionality of the CLC Main Workbench can be extended by installing plugins. The built-in
Plugin Manager provides an up-to-date listing of the plugins available.
Alternatively, visit our plugin webpage for a list: https://digitalinsights.qiagen.com/
products-overview/plugins/.
Plugins are installed and uninstalled using the Plugin Manager, which can be opened using the
Manage Plugins ( ) button in the Toolbar, or by going to the top level menu:
Utilities | Manage Plugins... ( )
Note: To install plugins and modules using a centrally installed CLC Workbench, the software
CHAPTER 1. INTRODUCTION TO CLC MAIN WORKBENCH 36
must be run in administrator mode. On Windows, right-click on the program shortcut and choose
"Run as Administrator". On Linux, this usually means running the software with sudo privileges.
The Plugin Manager has two tabs at the top:
• Download Plugins An overview of plugins available from QIAGEN that are not yet installed
on your Workbench.
1.5.1 Install
To install a plugin, open the Plugin Manager and click on the Download Plugins tab. This will
display an overview of the plugins available (figure 1.24).
Select a plugin in the list to display additional information about it in the right hand pane. Click
on Download and Install to to install the plugin.
Accepting the license agreement
The End User License Agreement (EULA) must be read and accepted as part of the installation
process. Please read the EULA text carefully, and if you agree to it, check the box next to the
text I accept these terms. If further information is requested from you, please fill this in before
clicking on the Finish button.
If you have a .cpa plugin installer file on your computer, for example if you have downloaded it
from our website, install the plugin by clicking on the Install from File button at the bottom of the
dialog and specifying the plugin *.cpa file.
When you close the Plugin Manager after making changes, you will be prompted to restart the
software. Plugins will not be fully installed, or removed, until the CLC Workbench has been
restarted.
CHAPTER 1. INTRODUCTION TO CLC MAIN WORKBENCH 37
1.5.2 Uninstall
Plugins are uninstalled using the Plugin Manager (figure 1.25). This can be opened using the
Manage Plugins ( ) button in the Toolbar, or by going to the top level menu:
Utilities | Manage Plugins... ( )
The installed plugins are shown in the Manage plugins tab of the plugin manager. To uninstall,
select the plugin in the list and click Uninstall.
If you do not wish to completely uninstall the plugin, but you do not want it to be used next time
you start the Workbench, click the Disable button.
When you close the dialog, you will be asked whether you wish to restart the workbench. The
plugin will not be uninstalled until the workbench is restarted.
In this list, select which plugins you wish to update, and click Install Updates. If you press
Cancel you will be able to install the plugins later by clicking Check for Updates in the Plugin
manager (see figure 1.25).
List hosts that should be contacted directly, i.e. not via the proxy server, in the Exclude hosts
field. The value can be a list, with each host separated by a | symbol. The wildcard character *
can also be used. For example: *.foo.com|localhost.
CHAPTER 1. INTRODUCTION TO CLC MAIN WORKBENCH 39
The proxy can be bypassed when connecting to a CLC Server by checking the box next to Bypass
proxy when connecting to a CLC Server.
Workbenches can be preconfigured to bypass the proxy settings when connecting to a CLC Server
by configuring this setting in a proxy.properties file, where the IP address (not the host name)
of the CLC Server is provided in the proxyexclude field. See https://resources.qiagenbioinformatics.
com/manuals/workbenchdeployment/current/index.php?manual=Per_computer_Workbench_information.html for
further details.
If you have problems with your proxy settings, please contact your systems administrator.
Part II
Core Functionalities
40
Chapter 2
User interface
Contents
2.1 View Area . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
2.1.1 Close views . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
2.1.2 Save changes in a view . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
2.1.3 Undo/Redo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
2.1.4 Arrange views in View Area . . . . . . . . . . . . . . . . . . . . . . . . . 47
2.1.5 Moving a view to a different screen . . . . . . . . . . . . . . . . . . . . . 48
2.1.6 Side Panel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
2.2 Zoom functionality in the View Area . . . . . . . . . . . . . . . . . . . . . . . 53
2.3 Toolbox panel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
2.4 Processes tab and Status bar . . . . . . . . . . . . . . . . . . . . . . . . . . 57
2.5 History and Element Info views . . . . . . . . . . . . . . . . . . . . . . . . . . 59
2.6 Workspace . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
2.7 List of shortcuts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
The user interface of the CLC Main Workbench when it is first opened looks like that shown in
figure 2.1.
Key areas are listed below with a brief description and links to further information.
• Navigation Area Data elements stored in File Locations are listed in the Navigation Area.
(Section 3.1).
Processes Running and finished processes are listed under this tab. (Section 2.4)
Tools Analysis tools are available under this tab. (Section 2.3)
Workflows Template workflows and installed workflows are available under this tab.
Apart from running workflows from here, you can right-click on them and choose to
open a copy of the workflow. This opens a copy in the Workflow Editor (section 13.1).
Favorites Tools you use most often are listed here, and you can add tools you want,
for quick access. (Section 2.3)
41
CHAPTER 2. USER INTERFACE 42
• View Area Data and workflows can be opened in this area for viewing and editing.
(Section 2.1) A Side Panel with configuration options is present on the right hand side when
items are open in this area (Section 4.6).
• Menu bar and Toolbar Many tools and associated actions can be launched using buttons
and options in these areas.
• Status Bar The Workbench status and its connections to other systems is presented in
this area. (Section 2.4).
Figure 2.1: The CLC Workbench interface includes the Navigation Area in the top left, several tabs
in the Toolbox area at the bottom left, a large viewing area on the right, menus and toolbars at the
top, and a status bar at the bottom. A sequence has been opened in the View area, and a Side
Panel containing settings relevant to viewing sequences is thus present on the far right.
Different areas of the interface can be hidden or made visible, as desired. Options controlling this
are available under the View menu at the top. For example, whether the Toolbox panel should be
visible, and which tabs should be visible in the Toolbox area can be configured using the menu
options under:
View | Show/Hide Toolbox
You can also collapse the various areas by clicking on buttons like ( ) or ( ), where they
appear. Similar buttons are presented for revealing areas if they are hidden.
Tabs can be dragged to put them in the order desired, and to move them to a different area in a
split view (section 2.1.4).
Right-clicking on a tab opens a menu with various navigation options, as well as the ability to
select tools and viewing options, etc.
2. Right-click on an element in the Navigation Area, and choose the Show option from the
context menu.
3. Drag elements from the Navigation Area into the Viewing Area.
4. Select an element in the Navigation Area and use the keyboard shortcut Ctrl + O ( + O on
macs)
5. Choose the option to "Open" results when launching an analysis. (This is only recommended
when small numbers of elements will be generated, and where it is not important to save
the results directly.)
When opening an element while another element is already open, the newly opened element will
become the active tab. Click on any other tab to open it and make it the active view. Alternatively,
use the keyboard shortcuts to navigate between tabs: Ctrl + PageUp or PageDown (or +
PageUp or PageDown on macs).
To provide more space for viewing data, you can hide Navigation Area and Toolbox by clicking the
hide icon ( ) at the top of the Navigation Area. You can also hide the Side Panel using the
same icon at the top of the Side Panel.
Tooltips
For some data types and some views, tooltips provide additional useful information. Hover the
mouse cursor over an area of interest to reveal these. For example, hover over an annotation on
a sequence and a tooltip containing details about that annotation is shown. Hover over a variant
in a variant track, and information about that variant is shown.
If you wish to hide such tooltips while moving the mouse around in a view, hold down the Ctrl key.
Tooltips can take a moment to appear. To make them show up immediately while moving the
mouse around in a view, hold down the Shift key.
Figure 2.2: Four elements are open in the View Area, organized in 2 areas horizontally - 3 elements
in the top area, and one in the bottom. The active view is in the top area, as indicated by the blue
bar just under the tabs.
For illustration, the icons for views available for sequence elements are shown in figure 2.3.
Clicking on the Show As Circular ( ) icon would present the sequence in a circular view. Mouse
over any of these icons to see the type of view they represent.
Figure 2.3: The icons presented at the bottom of an open nucleotide sequence. Clicking on each
of these presents a different view of the data.
Linked views Different views of the same element, or different elements referred to by another
element, can be opened in a "linked view". This is particularly useful with multiple viewing areas
open, i.e. split views. When views are linked, selecting an item or region in one view brings
the relevant item or region into focus in the linked view(s). See figure 2.4, where a region was
selected in one view, and that selection is then also shown in the other view.
To open a linked view, keep the Ctrl key ( on macs) depressed and then click on the item to
open. E.g. to open a different view of the same element, click on one of the icons at the bottom
of the open view. The new view will open, often in a second, horizontal view area. When the View
CHAPTER 2. USER INTERFACE 45
Area is already split horizontally, the new view is opened in the area not occupied by the original
view.
• Close Tab Group When tabs are open in a split view, all tabs in the same area as the
selected tab will be closed. When not in split view, this option has the same effect as
Close All Tabs.
• Close All Other Tabs Close all tabs in all tab areas except the selected tab.
Figure 2.5: Right-click on the tab for a view, to see the options relating to closing open views.
CHAPTER 2. USER INTERFACE 46
Figure 2.6: The ATP8a1 mRNA element has been edited, but the changes are not saved yet. This
is indicated by an * on the tab name in the View Area, and by the use of bold, italic font for the
element's name in the Navigation Area.
The Save function may be activated in two ways: Select the tab of the view you want to save and
Save ( ) or Ctrl + S ( + S on Mac)
If you close a tab of a view containing an element that was edited, you will be asked if you want
to save.
When saving an element from a new view that has not been opened from the Navigation Area, a
save dialog appears (figure 2.7). In this dialog, you can name the element and select the folder
in which you want to save the element.
Figure 2.7: Save dialog. The new element has been name "New element that needs to be saved"
and will be saved in the "Example Data" folder.
CHAPTER 2. USER INTERFACE 47
2.1.3 Undo/Redo
If you make a change to an element in a view, e.g. remove an annotation in a sequence or modify
a tree, you can undo the action. In general, Undo applies to all changes you can make when
right-clicking in a view. Undo is done by:
Click undo ( ) in the Toolbar or Ctrl + Z
If you want to undo several actions, just repeat the steps above.
To reverse the undo action:
Click the redo icon in the Toolbar or Ctrl + Y
Note! Actions in the Navigation Area, e.g., renaming and moving elements, cannot be undone.
However, you can restore deleted elements (see section 3.1.8).
You can set the number of possible undo actions in the Preferences dialog, see section 4.
Figure 2.9: Showing the table on one screen while the sequence is displayed on another screen.
Clicking the table of open reading frames causes the focus to shift to the corresponding region in
the linked view, and that region to be selected.
A red highlight indicates where it will be placed. Alternatively, clicking on the ( ) button at the
top left of the floating palette will place it at the bottom of the Side Panel. All floating palettes
can be re-docked by clicking on the ( ) button at the bottom of the Side Panel.
The whole Side Panel can be hidden or revealed using buttons at the top right: ( ) to hide the
Side Panel and ( ) to reveal it, if it was hidden. The keyboard shortcut Ctrl + U ( + U on Mac)
can also be used for these actions.
A gradient can be chosen from a predefined set of gradients (figure 2.13) or customized by
setting:
CHAPTER 2. USER INTERFACE 50
Figure 2.10: Side Panel settings for a nucleotide sequence. The Annotation layout palette is
expanded, while the remaining palettes are collapsed. In the bottom left corner of the Side Panel
are buttons for expanding, collapsing and re-docking all palettes.
Continuous: the color gradually changes from one set color and location to the next.
Discrete: only the set colors are used and they change abruptly at the specified
locations.
• Each color in the gradient and its location within the gradient.
Gradients settings can be reused, making it easy to apply the same gradient consistently across
different views. This is done using buttons in the 'Configure gradient' dialog (figure 2.13):
• Click on Copy All to copy the gradient configuration. You can paste this into a text file for
later use.
• Click on the Paste button to apply copied gradient settings. Colors and locations present
in the 'Configure gradient' dialog are overwritten by this action.
CHAPTER 2. USER INTERFACE 51
Figure 2.11: The Annotation types and Motifs palettes have been undocked. The Nucleotide info
palette has been moved at the top of the Side Panel. The background color of nucleotides reflects
the quality scores.
Figure 2.12: Clicking on the color of the mRNA annotation type opens a dialog where the color can
be changed.
CHAPTER 2. USER INTERFACE 52
Figure 2.13: Clicking on the gradient of the quality scores opens a dialog where the gradient can
be changed.
CHAPTER 2. USER INTERFACE 53
Figure 2.14: Zoom tools are located at the bottom right corner of the view.
• Shortcuts for zooming out to fit the width of the view ( ) or zooming in all the way to see
details ( ).
• A shortcut to zoom to a selection ( ). Select a region in the view, and then click this icon
to zoom in on the selected region. (Keyboard shortcut Ctrl + 1)
• A slider to zoom in and zoom out to any desired level. The slider position reflects the
current zoom level. Move the slider left to zoom out, or right to zoom in. For fine grained
control, click on the slider and move the mouse up slightly or down slightly.
• Mouse mode buttons:
Selection mode ( ). Used when you wish to select data in a view. This is the default.
Zoom in mode ( ) When selected, whenever you click the view, it zooms in.
Alternatively, click on a location in the view, and the view will zoom in, with the focus
on that location, or drag a box around an area, and the view will be zoomed to that
area. (Keyboard shortcut Ctrl + 2)
If you press and hold on ( ) or right-click on it, two other modes become available
(figure 2.15).
Panning ( ) When selected, you can pan around in the the view around using the
mouse. (Keyboard shortcut Ctrl + 4)
Zoom out ( ) When selected, whenever you click the view, it zooms out. (Keyboard
shortcut Ctrl + 3)
Additional notes:
• If you hold the mouse over the selection and zoom tools, tooltips will appear that provide
further information about how to use the tools.
• If you press the Shift button on your keyboard while in zoom mode, the zoom function is
reversed.
CHAPTER 2. USER INTERFACE 54
Figure 2.15: Additional mouse modes can be found in the zoom tools when right-clicking on the
magnifying glass.
• You may have to click in the view before you can use the keyboard or the scroll wheel to
zoom.
In many views, you can zoom in by pressing '+' on your keyboard, or zoom out by pressing '-' on
your keyboard.
If you have a mouse with a scroll wheel, you can also do the following:
Zoom in: Press and hold Ctrl ( on Mac) | Move the scroll wheel on your mouse forward
and
Zoom out: Press and hold Ctrl ( on Mac) | Move the scroll wheel on your mouse backwards
Tools tab
Tools available in the CLC Workbench are provided under the Tools menu, which is available in the
Toolbox panel and as a menu at the top of the Workbench. The Tools menu is also available, in an
extended form, in the Add Elements dialog available in the Workflow Editor (see section 13.1.1).
Tools are organized in folders according to their functionality. Tools provided by plugins may be
in a plugin-specific folder (figure 2.16). When connected to a CLC Genomics Server with external
applications configured and available, a folder for these will also be present in the Tools menu
(figure 2.17).
You can search for tools of interest in the Tools tab in the Toolbox by entering a search term into
the field at the top of the tab.
Workflows tab
CHAPTER 2. USER INTERFACE 55
Figure 2.16: The Tools tab in the Toolbox contains folders of available tools. This Workbench is not
connected to a CLC Server, as indicated by the grey server icon in the status bar.
Figure 2.17: This Workbench is connected to a CLC Server, as indicated by the blue server icon
in the status bar. External applications have been configured and enabled on that CLC Server, so
an External Applications folder is listed, which contains those external applications. The server icon
within that folder's icon is a reminder that these are only available when logged into the CLC Server.
Workflows installed on the Workbench and template workflows (see section 13.5) are listed in
the Workflows tab in the Toolbox (figure 2.18). The Workflows menu is also available at the top
of the Workbench.
When connected to a CLC Server, workflows installed on that server will be available from a folder
in the Workflows menu called Installed Workflows (Server) (figure 2.19).
You can search for workflows of interest in the Workflows tab in the Toolbox by entering a search
term into the field at the top of the tab.
CHAPTER 2. USER INTERFACE 56
Figure 2.18: The Workflows tab in the Toolbox contains folders for workflows installed on the
Workbench and template workflows. This Workbench is not connected to a CLC Server, as indicated
by the grey server icon in the status bar. There are also no active AWS Connections, as indicated
by the grey cloud icon in the status bar.
Figure 2.19: This Workbench is connected to a CLC Server, as indicated by the blue server icon in
the status bar. This CLC Server has workflows installed on it, so a folder containing those workflows
is present in the Workflows menu. (Installed Workflows (Server)).
Favorites tab
You can specify tools, or folders of tools, that you want to find quickly as favorites. In addition,
the 10 tools you use most frequently are automatically identified as your frequently used tools
or workflows. These lists are also made available in the Quick Launch tool ( ), started using
the Launch button in the toolbar, and in the Add Elements dialog available in the Workflow Editor
(see section 13.1.1).
Manually adding tools to the Favorites list is done from tabs in the Toolbox panel:
• Right-click on a tool or folder of tools in the Tools tab and choose the option "Add to
Favorites" from the menu that appears (figure 2.21), or
• Open the Favorites tab, right-click in the Favorites folder, choose the option "Add tools" or
"Add group of tools". Then select the tool or tool group to add.
• From the Favorites tab, click on an item in the Frequently used folder and drag it into the
Favorites folder.
CHAPTER 2. USER INTERFACE 57
Figure 2.20: Under the Favorites tab is a folder containing your frequently used tools, which are
added automatically, based on usage, and a folder containing tools you have specified as favorites.
Items within the Favorites folder can be re-ordered by opening the Favorites tab in the Toolbox
area in the bottom, left hand side of the Workbench and dragging tools up and down within the
list. (Folders cannot be repositioned.)
Figure 2.21: Tools or workflows can be added to the Favorites tab by right-clicking on them in the
Toolbox area and choosing the "Add to Favorites" option.
To remove an item from the Favorites tab, right-click on it and choose the option Remove from
Favorites from the menu that appears.
You can search for items of interest in the Favorites tab in the Toolbox by entering a search term
into the field at the top of the tab.
into a CLC Server, the status of your jobs that are running, completed or queued on the server,
are also displayed.
Several options are available after clicking on the small icon ( ) next to a given process, as
shown in figure 2.22).
Figure 2.22: Completed jobs run during a Workbench and the progress of running jobs is visible
in the Processes tab. The progress of a running job is also visible in the bottom frame of the
Workbench. Clicking the small icon next to a process in the Process tab reveals a menu with actions
that can be taken.
For completed jobs, these options provide a convenient way to locate results in the Navigation
Area:
• Show results Open the results generated by that process in the Viewing Area. (Relevant if
results were saved, as described in section 11.2.)
• Find results Highlight the results in the Navigation Area. (Relevant if results were saved,
as described in section 11.2.)
• Show Log Information Opens a log of the progress of the process. This is the same log
that opens if the option Open Log option is selected when launching a task.
• Show Messages Show any messages that were produced during the processing of your
data.
Stopped, paused and finished processes are not automatically removed from the Processes tab
during a Workbench session. They can, however, be removed by right clicking in the Processes
tab and selecting the option "Remove Finished Processes" or by going to the option in the main
menu system:
Utilities | Remove Finished Processes ( ) .
If you close the Workbench while jobs are still running on it, a dialog will ask for confirmation
before closing. Workbench processes are stopped when the software is closed and these
processes are not automatically restarted when you start the Workbench again. Closing the
Workbench does not interrupt jobs sent to a CLC Server, as described below.
CHAPTER 2. USER INTERFACE 59
• User The username of the person who performed the operation. If you import data created
by another person, that person's username will be shown.
• Date and time Date and time the operation was carried out. These are displayed according
to your locale settings (see section 4.1).
• Version The software name and version used for that operation.
Figure 2.23: The history of an element created by an installed workflow called assemble-seqs-wf.
• Comments Additional details added here by tools or details that have been added manually.
Click on Edit to add information to this field.
• Originates from A list of the elements that the current element originated from. Clicking on
the name of an originating element selects it in the Navigation Area. Click on the "(show)"
link to open the originating element to its default view. Click on "(history)" to open the
originating element to its History view.
• Column width
• Show column
• Workflow details Present if the element is an output from a workflow. The name and
version of the workflow are listed here, and if the element was generated by an installed
workflow (including template workflows), the workflow build id is also reported1 . If the
element is output by a workflow launched from the Workflow Editor, the version is reported,
but there will be no build id.
If an installer has never been made for a workflow, then data elements created using that
workflow (launched from the Workflow Editor), will have 0.1 reported as the workflow version in
their history. Workflows that have been used to make an installer inherit the most recent version
assigned when creating the workflow installer. See section 13.6.2 for more on creating workflow
installers.
1
Workflow build ids are included in the history of elements generated using version 24.0 or later.
CHAPTER 2. USER INTERFACE 61
2.6 Workspace
Workspaces are useful when working on more than one project. Open views and their arrangement
are saved in a given workspace. Switching between workspaces can thus save much time when
working on several different sets of data and results.
Initially, there is a single workspace called "Default". When you set up other workspaces, you
assign each a name, which is used when re-opening that workspace, and which is displayed in
the title bar of the Workbench when it is the active workspace.
The state of each workspace is saved automatically when the Workbench is closed down. The
workspace that was open when closing down is the one that will be opened when the Workbench
is started up again.
Figure 2.24: The workspace called "My Microbial Workspace" is open after selecting it from the
menu opened by clicking on the Manage Workspaces button in the Toolbar. The name of the
workspace is visible in the Workbench title bar.
Workspaces do not affect the underlying organization of data, so folders and elements remain
the same in the Navigation Area.
CHAPTER 2. USER INTERFACE 62
Workspaces can be created, opened and deleted using the options available under the Manage
Workspaces button in the top Toolbar, as described below. This functionality is also present
under the View menu.
Creating a workspace
Create a new workspace by clicking in the Manage Workspaces button in the top Toolbar.
In the drop-down menu that appears, choose the option Create Workspace.
In the dialog that appears, enter a name for the new workspace.
When you click on the OK button, the new workspace is created and opened. The name of the
workspace will be in the title bar of the Workbench.
Initially, the Navigation Area may be collapsed. Open it up again by clicking in the small black
triangle at the top right of the Navigation Area.
Opening a workspace
Switch between workspaces by clicking in the Manage Workspaces button in the top Toolbar and
selecting the desired workspace from the list presented.
The name of the active workspace will be greyed out in the list.
Deleting a workspace
To delete a workspace, click on the Manage Workspaces button in the top Toolbar and select
the option Delete Workspace.
Workspaces that can be deleted are listed in a drop-down menu in the dialog that appears. Select
the one to delete.
Deletion of workspaces cannot be undone.
Note: The Default workspace is not offered, as it cannot be deleted.
Contents
3.1 Navigation Area . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
3.1.1 Data structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
3.1.2 Adding and removing locations . . . . . . . . . . . . . . . . . . . . . . . 68
3.1.3 Data sharing information . . . . . . . . . . . . . . . . . . . . . . . . . . 71
3.1.4 Create new folders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
3.1.5 Multiselecting elements . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
3.1.6 Copying and moving elements and folders . . . . . . . . . . . . . . . . . 72
3.1.7 Updating element and folder names . . . . . . . . . . . . . . . . . . . . 73
3.1.8 Delete, restore and remove elements . . . . . . . . . . . . . . . . . . . 73
3.1.9 Show folder elements in a table . . . . . . . . . . . . . . . . . . . . . . 74
3.2 Working with non-CLC format files . . . . . . . . . . . . . . . . . . . . . . . . 76
3.3 Customized attributes on data locations . . . . . . . . . . . . . . . . . . . . 77
3.3.1 Setting custom attribute values . . . . . . . . . . . . . . . . . . . . . . . 79
3.3.2 Custom attributes on elements copied to other data locations . . . . . . 79
3.3.3 Searching custom attributes . . . . . . . . . . . . . . . . . . . . . . . . 80
3.4 Searching for data in CLC Locations . . . . . . . . . . . . . . . . . . . . . . . 81
3.4.1 Quick Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
3.4.2 Local Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
3.5 Backing up data from the CLC Workbench . . . . . . . . . . . . . . . . . . . 86
3.6 Working with AWS S3 using the Remote Files tab . . . . . . . . . . . . . . . 87
This chapter explains general data management features of CLC Main Workbench. The first
section explains the basics of the data organization and the Navigation Area. The next section
explains how to set up custom attributes for the data that can be used for more advanced data
management. Finally, there is a section about how to search for data in your CLC Locations. The
use of metadata tables in CLC Main Workbench is described separately, in chapter 12.
We recommend that data is only added and removed from CLC Data Locations using CLC
software. If files are moved using other methods, the data may not be found when launching
an analysis, and searches may not find that data. To address issues where data present in a
CLC Data Location cannot be found, re-build the index for that location and try again. Information
about rebuilding indexes can be found in section 3.4.
65
CHAPTER 3. DATA MANAGEMENT AND SEARCH 66
Each CLC data element has a name and an icon that represents the type of data in the
element. A list of many of the icons and the type of data they represent can be found at https:
//qiagen.my.salesforce-sites.com/KnowledgeBase/KnowledgeNavigatorPage?id=kA41i000000L5uFCAS.
Non-CLC files placed into CLC locations will have generic icons beside them, and any suffix in
the original file name will be visible in the Navigation Area. (e.g. .pdf, .xml and so on.)
Elements placed in a folder (e.g. by copy/pasting or dragging) are put at the bottom of the folder
listing. If the element is placed on another element, it is put just below that element in the folder
listing. If an element of the same name already exists in that folder, a copy is created with the
name extension "-1", "-2" etc. See section 3.1.6 for further details.
Elements in a folder can be sorted alphabetically by right-clicking on the folder and choosing the
option Sort Folder from the menu that appears. When sorting this way on Windows, subfolders
are placed at the top of the folder with elements listed below in alphabetical order. On Mac,
subfolders and elements are listed together, in alphabetical order.
Opening and viewing CLC data elements is described in section 2.1.
Just above the data listing area is a Quick Search field, which can be used to find elements in
your CLC Locations. See section 3.4.1.
Just above the Quick Search field are icons that can be clicked upon. On the left side, from left
to right:
• Collapse all ( ). Close all the open folders in the Navigation Area.
CHAPTER 3. DATA MANAGEMENT AND SEARCH 67
• Add File Location ( ). Add a new top level location for storing CLC data. See section 3.1.2
for further details.
• Decrease font size ( ) and increase font size ( ) Decrease or increase the font size
in the Navigation Area, both in the left hand side of the Workbench and other locations,
such as launch wizards steps where data elements can be selected. The font size in the
Tools, Workflows and Favorites tabs in the Toolbox, just below the Navigation Area, are
also adjusted.
• Restrict data types listed ( ) Click on this icon to reveal a list of data types. Selecting
one of these types will result in only elements of that types, and folders, being shown in the
Navigation Area. Click on it again to and select "All Elements" to see all elements listed
once more.
• Hide the Navigation Area and Toolbox ( ). This icon is at the top, right hand side.
Clicking on it hides the Navigation Area and the Toolbox panels, providing more area for
viewing data. To reveal the panels again, click on the ( ) icon that is shown when the
panels are hidden.
Figure 3.2: Data locations on a CLC Server are highlighted with blue square icons in the Navigation
Area.
Figure 3.3: Mousing over the 'CLC_Data' location reveals a tooltip showing the full path to the
folder on the underlying file system.
When the CLC Main Workbench is started for the first time, there will be a location called
CLC_Data, which is the default data location.
Adding more locations and removing locations is described in section 3.1.2. Another location
can be specified as the default by right-clicking on the location folder in the Navigation Area and
choosing the option Set as Default Location from under Location in the menu (figure 3.4). This
setting only applies to you. Other people using the same Workbench can set their own default
locations.
Administrators can also change the default data location for all users of a Workbench. This
is described at https://resources.qiagenbioinformatics.com/manuals/workbenchdeployment/current/
index.php?manual=Default_Workbench_data_storage.html.
Note: There will also be a location called CLC_References. This location is of relevance primarily
if you are working with others using a CLC Genomics Workbench, who are sharing their results
and reference data with you. It is intended for storing genomic references and associated data,
downloaded using the Reference Data Manager, distributed with the CLC Genomics Workbench.
• Windows: C:\Users\<your_username>\CLC_Data
• Mac: /CLC_Data
CHAPTER 3. DATA MANAGEMENT AND SEARCH 69
Figure 3.4: Data location options are available in a right-click context menu. Here, a new data
location is being specified as the default.
• Linux: /homefolder/CLC_Data
Adding locations
To add a new location, click on the ( ) icon at the top of the Navigation Area, or go to:
File | Location | New File Location ( )
Navigate to the folder to add as a CLC data location (see figure 3.5).
The name of the new location will be the name of the folder selected. To see the full path to the
folder on the file system, hover the mouse cursor over the location icon ( ).
The new location will appear in the Navigation Area (figure 3.6).
• You must have permission to read from that folder, and if you plan to save new data
elements or update data elements, you must have permission to write to that folder.
• The folder chosen as a CLC location must not a subfolder of any area already being used
as a CLC Workbench or CLC Server location.
CHAPTER 3. DATA MANAGEMENT AND SEARCH 70
Figure 3.6: A new CLC location has been added. When the selected folder has not been used as a
CLC location before, index files will be built, with the index building process listed in the Processes
tab below the Navigation Area.
Folders on a network drive or a removable drive can act as CLC locations. Please note though
that interruptions to file access can lead to problems. For example, if you have set up a CLC
location on One Drive, start editing a cloning experiment, and your laptop goes to sleep, unsaved
work may be lost, and errors relating to the lost connection may be reported. If your CLC locations
are on such systems, enabling offline line access (aka "always available files") can avoid such
issues.
Locations appear inactive in the Navigation Area if the relevant drive is not available when you
start up the Workbench. Once the drive is available, click on the Update All symbol ( ) at the
top of the Navigation area to refresh the view. All available locations will then be shown as active.
There can be sometimes be a short delay before the interface update completes.
See sectioN 3.1.3 for information relating to sharing CLC data locations.
Removing locations
To remove a CLC data location, right-click on the location (the top level folder), and select
Location | Remove Location. The Location menu is also available under the top level File menu.
CHAPTER 3. DATA MANAGEMENT AND SEARCH 71
CLC data locations that have been removed can simply be re-added if you wish to access the
data via the Workbench Navigation Area again.
After removing the CLC location, standard operating system functionality can be used to remove
the folder and its contents from the local file system, if desired.
• We do not support concurrent alteration of data. While the software will often detect this
situation and handle it appropriately, by for example only allowing read access to all but
the one party editing the file, we do not guarantee this.
• Any functionality that involves using the data search indices, (e.g. search functionality,
associating metadata with data), will not work properly for shared data locations. Re-
indexing a Data Location can help in the short term, but as soon as a new file is created
by another piece of software, the index will be out of date.
If you decide to share data via Workbenches this way, it is vital that when adding a CLC location
already used by other Workbenches as a CLC location, the exact same folder in the file system
hierarchy as the other Workbenches have used is the one selected to add as a location.
Indicating a folder higher up or lower down in the hierarchy will cause problems with the indexing
of the files. This can lead to newly created objects made by Workbench A not being found when
searching from Workbench B and vice versa, as well as issues with associations to CLC Metadata
Tables.
• Holding down the <Ctrl> key ( on Mac) while clicking on multiple elements selects the
elements that have been clicked.
• Selecting one element, and selecting another element while holding down the <Shift> key
selects all the elements listed between the two locations (the two end locations included).
• Selecting one element, and moving the cursor with the arrow-keys while holding down the
<Shift> key, enables you to increase the number of elements selected.
• As keyboard shortcuts:
When you cut an element, it will appear "grayed out" until you activate the paste function.
You can revert the cut command by copying another element.
Copies of an element open in the View area can also be made by clicking on its tab in the View
Area and dragging the tab to the desired location in the Navigation Area. This is not a way to save
updates to an existing element. Any unsaved changes to the original element (the one open in
the View area) remain unsaved until an explicit save action is taken on the original.
• Slow double-click on the item's name. I.e. Click on the name once, leave a short pause
and click on the name again.
The speed of a slow double-click is usually defined at the system level. Double-clicking
quickly on an element's name will open it in the viewing area, and double-clicking quickly
on a folder name will open a closed folder or close an open folder.
• Click on the item's name to select it and then click on the function key F2.
• Click on the item's name to select it, and then select the option Rename from the top-level
Edit menu.
When you have finished editing the name, click on the Enter key or select another element in the
Navigation Area. To disregard changes before saving them, click on the Esc key.
If you update the name of an item you do not have permission to change, the new name will not
be kept. The original name will be retained.
Renaming annotations is described in section 14.3.3.
1. Move it to the recycle bin by using the Delete ( ) option from the Edit menu, the right-click
menu of an element, or in the Toolbar, or use the Delete key on your keyboard.
2. Empty the recycle bin using the Empty Recycle Bin command available under the Edit
menu or in the menu presented when you right-click on a Recycle Bin ( ).
Note! Emptying the recycle bin cannot be undone. Data is not recoverable after it has been
removed by emptying the recycle bin.
For deleting annotations from sequences, see section 14.3.5.
To restore items in a recycle bin:
• Drag the items using your mouse into the folder where they used to be, or
• Right-click on the element and choose the option Restore from Recycle Bin.
• The contents of your server-based recycle bin can be accessed by you and by your server
administrator.
• CLC Server settings can affect how you work with server-based recycle bins. For example:
Batch edit folder elements You can select a number of elements in the table, right-click and
choose Edit to batch edit the elements. In this way, you can change for example the description
or name of several elements in one go.
In figure 3.8 you can see an example where the name of two sequence are renamed in one go.
In this example, a dialog with a text field will be shown, letting you enter a new name for these
two sequences.
Drag and drop folder elements You can drag and drop objects from the folder editor to the
Navigation area. This will create a copy of the objects at the selected destination. New elements
can be included in the folder editor in the view area by dragging and dropping an element from
a destination in the Navigation Area to the folder in the Navigation Area that you have open in
the view area. It is not possible to drag elements directly from the Navigation Area to the folder
editor in the View area.
CHAPTER 3. DATA MANAGEMENT AND SEARCH 76
2. Right-click on a selected file and choose the option Save to disk... from the menu
(figure 3.9).
Figure 3.9: Select one or more non-CLC format files, right-click and choose the option "Save to
disk...".
Using drag and drop for copying or moving non-CLC format files
Non-CLC format files can be saved to another place accessible on your system using drag and
drop. To do this:
2. Keeping the mouse button depressed, drag the selection to a local file browser.
Note: Dragging to a place on the same file system as the CLC Location results in the file(s)
being moved from the CLC Location to the new location. Dragging to a location on a different
file system results in the file(s) being copied, thus leaving the original file(s) in place in the CLC
Location.
The "Save to disk..." functionality described in the section above always makes a copy.
Figure 3.10: The Attribute Manager is used to add custom attributes to a CLC Data Location or
CLC Server File System Location.
Click on the Add Attribute ( ) button to create a new attribute. The following attribute types are
available in the Create Attribute dialog (figure 3.11):
• List The value for this attribute type will be specified by selecting from a drop-down list.
The values in that list are defined by clicking on the Add Value ( ) button in the right-hand
panel (figure 3.12).
• Bounded number Same as number, but minimum and maximum accepted values can be
specified.
• Bounded decimal number Same as decimal number, but minimum and maximum accepted
values can be specified.
Figure 3.11: The attribute type is selected from a drop-down list, and a name is then assigned to
the attribute. Values are inheritable by default.
When a data element is copied, attribute values are transferred to the copy of the element by
default. This can be changed by unchecking the Values are inheritable checkbox.
When you click on the Create button in the Create Attributes dialog, the attribute will appear
in the list on the left side of the Attribute Manager. When an attribute in that list is selected,
information about it is shown in the right hand panel. For List attributes, values to include in the
drop-down list can also be added or removed (figure 3.12).
Figure 3.12: Information about a custom attribute can be seen on the right-hand side of the
Attribute Manager. For List attributes, values can be added or removed.
Note: Renaming an attribute or changing its type after creation is not possible.
CHAPTER 3. DATA MANAGEMENT AND SEARCH 79
Changing the order of custom attributes The position of the attribute in the list in the Attribute
Manager is the same relative position the attributes are presented in the Element Info view
for that data element (figure 3.13), and in the table view of a folder (see https://resources.
qiagenbioinformatics.com/manuals/clcgenomicsworkbench/current/index.php?manual=Show_folder_elements_
in_table.html).
Change the order of the attributes by selecting an attribute and clicking on the up arrow or down
arrow in the Attribute Manager.
• That attribute is no longer present for data elements created after the attribute removal or
for data elements created earlier, but where no value had yet been set for that attribute.
• The attribute and its value will remain present for data elements where a value had
previously been set, but that value will no longer be editable.
• It can be restored by creating a new attribute for that CLC location using the identical
name and type of the removed attribute. Data elements that had values set for the original
attribute will still have the attribute and value, with the value once again being editable.
The text "Not set" appears in red next to attributes with no value set in a data element's Element
Info view (section 2.5) (figure 3.14). No value is assigned by default.
Note: After editing attribute values in the Element Info view, be sure to save the changes to the
element (section 2.1.2).
To reset an attribute value in the Element Info view, click on "Clear" beside the attribute name.
This returns it to the "Not set" state.
The updated, saved attribute information can be searched for, as described in Searching custom
attributes (section 3.3.3).
Figure 3.13: Custom attribute values are added and edited in the Element Info view of a data
element.
Figure 3.14: "Not set" is displayed in red text in the Element Info view for the Hyper_link attribute
because no value has been set for that attribute.
Any custom attributes that did not have a value set at the time the data will copied, are removed
in the copy. If copied back to the original CLC Data Location, all location-specific attributes are
again available, and those without values are, as usual, not set by default.
A custom attribute of a given name in a given CLC Data Location is different to a custom attribute
of the same name in a different CLC Data Location. So, for example, a custom attributed called
"Shelf number" can exist, separately, in two CLC Data Locations. If a data element is copied from
one of those locations to the other, it could have two custom attributes called "Shelf number",
where only the value for the current location would be editable. Local searches consider only the
names of attributes, so searching would find the data element based on the value for either of
these "Shelf number" attributes.
See Searching custom attributes (section 3.3.3) for further details.
For further details about searching, see Local Search and Quick Search.
• Quick Search Available above the Navigation Area and described in section 3.4.1. By
default, terms entered are used to search for the names of data elements and folders
across all available CLC Locations.
• Local Search Available under the Utilities menu and described in section 3.4.2. All CLC
Locations can be searched, or an individual Location can be searched. Local searches can
be saved, so that the same search can be run again easily.
• Broccoli sequence
• Coliform set
Search with a single term to look for any element or folder with a name containing that term.
Example 1: A search for coli would return all 3 elements listed above.
Search with two or more terms to look for any element or folder with a name containing all of
those terms.
Example 2: A search for coli set would return "Coliform set" but not the other two entries listed
in the earlier example.
Search with two or more words in quotes to look for any element or folder name containing
those words, appearing consecutively, in the order provided. Whole words must be used within
quotes, rather than partial terms.
For searching purposes, words are the terms on either side of a space, hyphen or underscore in
a name. The names of elements and folders are split into words when indexing.
Example 3: A search for "coli reference" would find an element called "E. coli reference sequence".
Example 4: A search for "coli sequence" would not return any of the elements in the example
list. In the name "E. coli reference sequence", the words coli and sequence are not placed
consecutively, and in "Broccoli sequence", "coli" is a partial term rather than a whole word.
Why only words when searching with quotes? The use of quotes allows quite specific searches
to be run quickly, but only using words, as defined by the indexing system.
Tip: Searches with whole words are faster than searching with partial terms. If a term is a word
in some names but a partial term in others, the hits found using the complete word are returned
first. E.g. searches with the term cancer would return elements with names like "cancer reads"
and "my cancer sample" before an element with a names like "cancerreads".
Note: Wildcards (* ? ~) are ignored in basic searches. If you wish to define a search using
wildcards, use the advanced search functionality of Quick Search.
Example 6: A search for path:tutorials cancer would find all data elements or folders where at
least one folder in its path contained "tutorials" (with capital or small letters) and the name of
the folder or element reported including the word "cancer". In this example, an element named
"cancer tissue reads" in a subfolder under "CLC_Tutorials" would be found, and so would an
element called "cancerreads" in that subfolder.
Figure 3.15: Enter terms in the Quick Search field to look for elements and folders.
• Wildcard multiple character search (*). Appending an asterisk * to the search term find
matches that start with that term. E.g. A search for BRCA* will find terms like BRCA1,
BRCA2, and BRCA1gene.
• Wildcard single character search (?). The ? character represents exactly one character.
For example, searching for BRCA? would find BRCA1 and BRCA2, but would not find
BRCA1gene.
• Search related words (~). Appending a tilde to the search term looks for fuzzy matches,
that is, terms that almost match the search term, but are not necessarily exact matches.
For example, : ADRAA~ will find terms similar to ADRA1A.
Search results
When there are many hits, only the first 50 are shown initially. To see the next 50, click on the
Next ( ) arrow just under the list of results.
The number to return initially can be configured in the Workbench Preferences, as described in
section 4.
• Right-click on a search result and click on Show Location in the menu presented.
Figure 3.16: Recent searches are listed and can be selected to be re-run by clicking on the icon to
the left of the search field.
2. Paste the contents of the clipboard (i.e. the copied information) to a place that expects
text. The text that will be pasted is the CLC URL for that element or folder.
Examples of where text is expected include a text editor, email, messaging system, etc. It
also includes the Quick Search field.
3. Paste the CLC URL into the Quick Search field above the Navigation Area to locate the
element or folder it refers to.
If you move the element or folder within the same CLC Location, the CLC URL will continue to
work.
• Broccoli sequence
• Coliform set
Search with a single term to look for any element or folder with a name containing that term.
Example 1: A search for coli would return all 3 elements listed above.
Search with two or more terms to look for any element or folder with a name containing all of
those terms.
Example 2: A search for coli set would return "Coliform set" but not the other two entries listed
in the earlier example.
Search with two or more words in quotes to look for any element or folder name containing
those words, appearing consecutively, in the order provided. Whole words must be used within
quotes, rather than partial terms.
For searching purposes, words are the terms on either side of a space, hyphen or underscore in
a name. The names of elements and folders are split into words when indexing.
Example 3: A search for "coli reference" would find an element called "E. coli reference sequence".
Example 4: A search for "coli sequence" would not return any of the elements in the example
list. In the name "E. coli reference sequence", the words coli and sequence are not placed
consecutively, and in "Broccoli sequence", "coli" is a partial term rather than a whole word.
CHAPTER 3. DATA MANAGEMENT AND SEARCH 86
Why only words when searching with quotes? The use of quotes allows quite specific searches
to be run quickly, but only using words, as defined by the indexing system.
Tip: Searches with whole words are faster than searching with partial terms. If a term is a word
in some names but a partial term in others, the hits found using the complete word are returned
first. E.g. searches with the term cancer would return elements with names like "cancer reads"
and "my cancer sample" before an element with a names like "cancerreads".
Note: Wildcards (* ? ~) are ignored in basic searches. If you wish to define a search using
wildcards, use the advanced search functionality of Quick Search.
• Click on the tab of the search view and drag and drop it into a folder in the Navigation Area.
These actions save the search query. (It does not save the search results.)
This can be useful when you run the same searches periodically.
• Remove the original CLC File Location, and then add the folder from backup as a new CLC
File Location,
or
• Remove the file called ".clcinfo" from the top level of the folder from backup, and then add
the folder as a CLC File Location.
CHAPTER 3. DATA MANAGEMENT AND SEARCH 87
CLC File Location information is stored in an XML file called model_settings_300.xml lo-
cated in the settings folder in the user home area. Further details about this file and
how it pertains to data locations in the Workbench can be found in the Workbench Deploy-
ment Manual: https://resources.qiagenbioinformatics.com/manuals/workbenchdeployment/current/
index.php?manual=Default_Workbench_data_storage.html.
Option 2: Export a folder of data or individual data elements to a CLC zip file
This option is for backing up smaller amounts of data, for example, backing up certain results,
or backing up a whole CLC File Location, that contains a small amount of data.
To export data, click on the Export ( ) button in the top toolbar, or go to:
File | Export ( )
Choose zip as the format to export to.
The data to export to the zip file can then be selected.
Further details about exporting data this way is provided in section 8.1.4.
To imported the zip file back into a CLC Workbench, click on the Import ( ) button in the top
toolbar and select Standard Import, or go to:
File | Import ( ) | Standard Import
and select Automatic import in the Options area.
Figure 3.18: AWS S3 buckets you have access to are available under the Remote Files tab.
Figure 3.19: This Workbench has a valid AWS Connection and is connected to a CLC Server with a
valid AWS Connection. At least one public S3 bucket has been configured in the Workbench and in
the CLC Server. The S3 buckets available from the selected source are listed in the Remote Files
tab.
To upload data from your Navigation Area to AWS S3, right-click on a folder in the Remote Files
tab and choose the option Upload to this folder (figure 3.20).
Figure 3.20: To upload data from your Navigation Area to AWS S3, open the Remote Files tab,
right-click on the folder you wish to upload data to, and select the option "Upload to This Folder".
Upload is sequential. Information about the data upload is shown in the Processes tab, at the
bottom left of the Workbench (figure 3.21).
Figure 3.21: After choosing to upload data to S3, the progress of the upload is reported in the
Processes tab.
Figure 3.22: Select folders and/or files in the Remote Files tab and right-click to reveal options for
downloading that data.
opened from that list, or individual elements can be selected and downloaded. The Execution
Log is also available from this list (see figure 3.23).
If the Navigation Tools plugin is installed, bookmarks for items in the Remote Files tab can be
made. Double-clicking on bookmarks for individual results files or folders opens the bookmarked
items, as standard. Double-clicking on a bookmark for a workflow-result.json file reveals the same
list of options as double-clicking on the workflow-result.json file in the Remote Files tab directly.
Further details about bookmarks are provided in the Navigation Tools manual at https://resources.
qiagenbioinformatics.com/manuals/navigationtools/current/index.php?manual=Introduction.html.
Note: AWS charges for downloading data from S3. By default, when the download size exceeds
1 GB, you are prompted for confirmation that you wish to proceed. The size required to trigger
this warning can be changed in the General section of the Workbench Preferences (figure 3.24).
CHAPTER 3. DATA MANAGEMENT AND SEARCH 90
Figure 3.23: Double-click on a workflow-result.json file in the Remote Files tab in the Workbench to
reveal a list of all results from a job run in the cloud, as well as the Execution Log. All items can be
downloaded and opened from this menu, or individual items can be selected and downloaded.
Figure 3.24: The download size above which a cost warning dialog is shown can be adjusted in the
Workbench Preferences. The default value is 1000 MB.
Chapter 4
Contents
4.1 General preferences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
4.2 View preferences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
4.3 Data preferences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
4.4 Advanced preferences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
4.5 Export/import of preferences . . . . . . . . . . . . . . . . . . . . . . . . . . 98
4.6 Side Panel view settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
The Preferences dialog (figure 4.1) offers opportunities for changing the default settings for
different features of the program. Preference settings are grouped under four tabs, each of which
is described in the sections to follow.
The Preferences dialog is opened in one of the following ways:
Edit | Preferences ( )
or Ctrl + K ( + ; on Mac)
Figure 4.1: Preference settings are grouped under the General, View, Data , and Advanced tabs.
92
CHAPTER 4. USER PREFERENCES AND SETTINGS 93
• Undo Limit The default number of undo actions is 500. Undoing and redoing actions is
described in section 2.1.3).
• Audit Support If this option is checked, all manual editing of sequences will be marked
with an annotation on the sequence (figure 4.2). Placing the mouse on the annotation will
reveal additional details about the change made to the sequence (see figure 4.3). Note
that no matter whether Audit Support is checked or not, all changes are recorded in the
History log ( ) (see section 2.5).
• Number of hits The number of hits shown in CLC Main Workbench, when e.g. searching
NCBI. (The sequences shown in the program are not downloaded, until they are opened or
dragged/saved into the Navigation Area).
• Locale Setting Specify which country you are located in. This determines how punctuation
is used in numbers.
• Show Dialogs Many information dialogs have a checkbox with the option: "Never show this
dialog again". If you have checked such a box, but later decide you wish to see these
notifications, click on the Show Dialogs button.
• Usage information When this item is checked, anonymous information is shared with
QIAGEN about how the Workbench is used. This option is enabled by default.
The information shared with QIAGEN is:
Launch information (operating system, product, version, and memory available)
The names of the tools and workflows launched (but not the parameters or the data
used)
Errors (but without any information that could lead to loss of privacy: file names and
organisms will not be logged)
Installation and removal of plugins and modules
The following information is also sent:
An installation ID. This allows us to group events coming from the same installation.
It is not possible to connect this ID to personal or license information.
A geographic location. This is predicted based on the IP-address. We do not store
IP-addresses after location information has been extracted.
A time stamp
CHAPTER 4. USER PREFERENCES AND SETTINGS 94
Figure 4.4: Settings relating to views and formating are found under the View tab in Preferences.
1. Toolbar Specify Toolbar icon size, and whether to display names below those icons.
2. Show Side Panel Choose whether to display the Side Panel by default when opening a new
view.
For any open view, the Side Panel can be collapsed by clicking on the small triangle at the
top left side of the settings area or by using the key combination Ctrl + U ( + U on Mac).
4. Sequence Label allows you to change the source of the name to use when listing sequence
elements in the Navigation Area.
5. User Defined View Settings Data types for which custom view settings have been defined
are listed here. The default settings to apply to a given data type can be specified.
Custom view settings can be exported to a file and imported from a file using the Export...
and Import... button, respectively.
To export, select items in the "Available Editors" list and then click on the Export button. A
.vsf file will be saved to the location you specify. You will have the opportunity to deselect
any custom view settings you do not wish to export.
Figure 4.5: Data types for which custom view settings have been defined are listed in the View tab.
Settings for multiple views can be exported by selecting them in the list and clicking on the Export...
button. Any custom views that should not be included can be delselected before exporting.
To import view settings, select a .vsf file and click on the Import... button. Specify
whether the new settings should be merged with the existing settings or whether they
should overwrite the existing settings (figure 4.6). Note: If you choose to overwrite existing
settings, all existing custom view settings are deleted.
Figure 4.6: When importing view settings, specify whether to merge the new settings with the
existing ones or whether to overwrite existing custom settings.
Note: The Export and Import buttons directly under the list of view settings are for exporting
and importing just view settings. The buttons at the bottom of the Preferences dialog are
for exporting all preferences (see section 4.5).
Specifying default view settings for a given data type can also be done using the Manage
View Settings dialog, described in section 4.6. Export and import can also be done there.
6. Molecule Project 3D Editor gives you the option to turn off the modern OpenGL rendering
for Molecule Projects (see section 15.2).
CHAPTER 4. USER PREFERENCES AND SETTINGS 96
• Multisite Gateway Cloning primer additions, a list of predefined primer additions for Gateway
cloning (see section 23.5.1).
List hosts that should be contacted directly, i.e. not via the proxy server, in the Exclude hosts
field. The value can be a list, with each host separated by a | symbol. The wildcard character *
can also be used. For example: *.foo.com|localhost.
The proxy can be bypassed when connecting to a CLC Server by checking the box next to Bypass
proxy when connecting to a CLC Server.
CHAPTER 4. USER PREFERENCES AND SETTINGS 97
Workbenches can be preconfigured to bypass the proxy settings when connecting to a CLC Server
by configuring this setting in a proxy.properties file, where the IP address (not the host name)
of the CLC Server is provided in the proxyexclude field. See https://resources.qiagenbioinformatics.
com/manuals/workbenchdeployment/current/index.php?manual=Per_computer_Workbench_information.html for
further details.
If you have problems with your proxy settings, please contact your systems administrator.
Default data location The default location is used when you import a file without selecting a
folder or element in the Navigation Area first. It is set to the folder called CLC_Data in the
Navigation Area, but can be changed to another data location using a drop down list of data
locations already added (see section 3.1.2). Note that the default location cannot be removed,
but only changed to another location.
Data Compression CLC format data is stored in an internally compressed format. The application
of internal compression can be disabled by unchecking the option "Save CLC data elements in a
compressed format". This option is enabled by default. Turning this option off means that data
created may be larger than it otherwise would be.
Enabling data compression may impose a performance penalty depending on the characteristics
of the hardware used. However, this penalty is typically small, and we generally recommend that
this option remains enabled. Turning this option off is likely to be of interest only at sites running
a mix of older and newer CLC software, where the same data is accessed by different versions
of the software.
Compatibility information:
• A new compression method was introduced with version 22.0 of the CLC Genomics
Workbench, CLC Main Workbench and CLC Genomics Server. Compressed data created
using those versions can be read by version 21.0.5 and above, but not earlier versions.
• Internal compression of CLC data was introduced in CLC Genomics Workbench 12.0, CLC
Main Workbench 8.1 and CLC Genomics Server 11.0. Compressed data created using
these versions is not compatible with older versions of the software. Data created using
these versions can be opened by later versions of the software, including versions 22.0
and above.
To share specific data sets for use with software versions that do not support the compression
applied by default, we recommend exporting the data to CLC or zip format and turning on the
export option "Maximize compatibility with older CLC products". See section 8.1.4.
NCBI Integration Without an API key, access to NCBI from asingle IP-address is limited to 3
requests per second; if many workbenches use the same IP address when running the Search
for Sequences at NCBI and Search for PDB Structures at NCBI tools they may hit this limit. In
this case, you can create an API key for NCBI E-utilities in your NCBI account and enter it here.
NCBI BLAST The standard URL for the BLAST server at NCBI is: https://blast.ncbi.nlm.
nih.gov/Blast.cgi, but it is possible to specify an alternate server URL to use for BLAST
searches. Be careful to specify a valid URL, otherwise BLAST will not work.
CHAPTER 4. USER PREFERENCES AND SETTINGS 98
Note: The "User Defined View Settings" option here refers only to information on which view
settings to set as the default for each view type. To export the view settings themselves, export
a .vsf file from the User Defined View Settings section under the View tab of Preferences, as
described in section 4.2.
Figure 4.9: Click on the View Settings button at the bottom of a Side Panel to apply a new view
settings or to open dialogs for saving and managing view settings.
This section focuses on the functionality provided under the View Settings... menu for applying
and managing view settings. For general information about Side Panel settings, see section 2.1.6.
For view settings specific to tables, including column ordering, see section 9.
should be made available for other elements. In the latter case, you can specify if this group of
settings should be used as the default for this view, thereby affecting all elements with that view.
Figure 4.10: Click on the Save View Settings menu item (top) to open a dialog for saving the
settings. A name needs to be supplied for these settings. The settings can be made available only
for the data element being used or for all data elements of that type. Here, these settings have
been set as the default for all elements of this type (bottom).
View settings are user-specific. If your CLC Workbench is shared by multiple people, you will need
to export any custom view settings you wish them to have access to and they will need to import
them, as described in the Sharing view settings section below.
Figure 4.11: Select from saved view settings for the type of element open by clicking on the View
Settings button at the bottom of a Side Panel.
View settings named CLC Standard Settings are available for each data type. Until custom view
settings are saved and set as the default for a given data type, the CLC Standard Settings are
used.
Figure 4.12: In the Manage View Settings dialog, you can specify the default for that view, delete
saved settings, as well as export and import view settings.
To browse all custom view settings available in your CLC Workbench, open the View tab under
Preferences ( ), as described in section 4.2.
Note: To export and import view settings for multiple view types, use the functionality under
Preferences ( ), described in section 4.2.
Chapter 5
Printing
Contents
5.1 Selecting which part of the view to print . . . . . . . . . . . . . . . . . . . . 103
5.2 Page setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
5.3 Print preview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
CLC Main Workbench offers different choices of printing the result of your work.
This chapter deals with printing directly from CLC Main Workbench. Another option for using the
graphical output of your work, is to export graphics (see chapter 8.2) in a graphic format, and
then import it into a document or a presentation.
All the kinds of data that you can view in the View Area can be printed. The CLC Main Workbench
uses a WYSIWYG principle: What You See Is What You Get. This means that you should use the
options in the Side Panel to change how your data, e.g. a sequence, looks on the screen. When
you print it, it will look exactly the same way on print as on the screen.
For some of the views, the layout will be slightly changed in order to be printer-friendly.
It is not possible to print elements directly from the Navigation Area. They must first be opened
in a view in order to be printed. To print the contents of a view:
select relevant view | Print ( ) in the toolbar
This will show a print dialog (see figure 5.1).
In this dialog, you can:
102
CHAPTER 5. PRINTING 103
These options are available for all views that can be zoomed in and out. In figure 5.2 is a view of
a circular sequence which is zoomed in so that you can only see a part of it.
When selecting Print visible area, your print will reflect the part of the sequence that is visible in
the view. The result from printing the view from figure 5.2 and choosing Print visible area can be
seen in figure 5.3.
On the other hand, if you select Print whole view, you will get a result that looks like figure 5.4.
This means that you also print the part of the sequence which is not visible when you have
zoomed in.
CHAPTER 5. PRINTING 104
Figure 5.4: A print of the sequence selecting Print whole view. The whole sequence is shown, even
though the view is zoomed in on a part of the sequence.
• Orientation.
• Paper size. Adjust the size to match the paper in your printer.
CHAPTER 5. PRINTING 105
• Fit to pages. Can be used to control how the graphics should be split across pages (see
figure 5.6 for an example).
Horizontal pages. If you set the value to e.g. 2, the printed content will be broken
up horizontally and split across 2 pages. This is useful for sequences that are not
wrapped
Vertical pages. If you set the value to e.g. 2, the printed content will be broken up
vertically and split across 2 pages.
Figure 5.6: An example where Fit to pages horizontally is set to 2, and Fit to pages vertically is set
to 3.
Note! It is a good idea to consider adjusting view settings (e.g. Wrap for sequences), in the
Side Panel before printing. As explained in the beginning of this chapter, the printed material will
look like the view on the screen, and therefore these settings should also be considered when
adjusting Page Setup.
Header and footer Click the Header/Footer tab to edit the header and footer text. By clicking
in the text field for either Custom header text or Custom footer text you can access the auto
CHAPTER 5. PRINTING 106
formats for header/footer text in Insert a caret position. Click either Date, View name, or User
name to include the auto format in the header/footer text.
Click OK when you have adjusted the Page Setup. The settings are saved so that you do not
have to adjust them again next time you print. You can also change the Page Setup from the File
menu.
The Print preview window lets you see the layout of the pages that are printed. Use the arrows
in the toolbar to navigate between the pages. Click Print ( ) to show the print dialog, which lets
you choose e.g. which pages to print.
The Print preview window is for preview only - the layout of the pages must be adjusted in the
Page setup.
Chapter 6
Contents
6.1 CLC Server connection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
6.1.1 CLC Server data import and export . . . . . . . . . . . . . . . . . . . . . 110
6.2 AWS Connections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
Under the Connections menu are tools for connecting the CLC Main Workbench to other systems.
• Data in CLC Server locations will be listed in the Workbench Navigation Area.
• When launching analyses that can be run on the CLC Server, you will be offered the choice
of running them using the Workbench or the CLC Server.
• External applications configured and enabled on the CLC Server will be available to launch
and to include in workflows.
107
CHAPTER 6. CONNECTIONS TO OTHER SYSTEMS 108
Figure 6.1: Logging into a CLC Server using the default port, 7777
Your username and the server details are saved between Workbench sessions. To save your
password also, check the Remember password box.
To connect to the CLC Server automatically when the CLC Workbench starts up, check the Log
into CLC Server at Workbench startup box. This option is only available when the Remember
password option has been enabled.
Figure 6.2: Hover the mouse cursor over the Server icon in the bottom left corner of the Workbench
frame to quickly view the status of the server connection.
• Monitoring processes sent to the CLC Server from a CLC Workbench: section 2.4
• Viewing and working with data held on a CLC Server: section 3.1,
• Importing data to and exporting data from a CLC Server is described in section 6.1.1.
For those logging into the CLC Server as a user with administrative privileges, an option called
Manage Server Users and Groups... will be available. This is described at https://resources.
qiagenbioinformatics.com/manuals/clcserver/current/admin/index.php?manual=User_authentication_via_
Workbench_built_in_authentication.html.
Figure 6.3: A warning is shown when the certificate is not signed by a recognized CA.
The certificate details can be viewed again later by clicking on the SSL Certificate button in the
CLC Server Connection dialog.
CHAPTER 6. CONNECTIONS TO OTHER SYSTEMS 110
The connection status information in the tooltip revealed when hovering over the ( ) icon in the
bottom left corner of the Workbench frame includes whether the connection is encrypted or not
(figure 6.4).
Figure 6.4: Login details and connection status information for an unencrypted connection to a
CLC Server (left) and an encrypted connection (right). A padlock on the server icon in the bottom
left corner of the Workbench frame also indicates the connection is encrypted.
Figure 6.5: When an import is run on a CLC Server, the list of locations that data can imported
from reflects the server configuration.
Note that when importing data from an AWS S3 bucket, the data is first downloaded from AWS,
which AWS charges for.
Data in Workbench or Server CLC File System Locations can be selected for export.
Exported files can be saved to areas the CLC Main Workbench has access to, including
AWS S3 buckets if an AWS S3 Connection has been configured in the CLC Main
Workbench.
• Running the export on the CLC Server or Grid via CLC Server:
Data from Server File System Locations can be can be selected for export.
Exported files can be saved to Server import/export directories or to an AWS S3 bucket
if an AWS Connection has been configured in the CLC Server.
• Submitting analyses to a CLC Genomics Cloud setup, if available on that AWS account.
CHAPTER 6. CONNECTIONS TO OTHER SYSTEMS 112
Configuring access to your AWS accounts requires AWS IAM credentials. Configuring access to
public S3 buckets requires only the name of the bucket.
Working with stored data in AWS S3 buckets via the Workbench is of particular relevance when
submitting jobs to run on a CLC Genomics Cloud setup making use of functionality provided by
the CLC Cloud Module.
When launching workflows to run locally using on-the-fly import and selecting files from AWS S3,
the files selected are first downloaded to a temporary folder and are subsequently imported.
All traffic to and from AWS is encrypted using a minimum of TLS version 1.2.
Figure 6.6: The configuration dialog for AWS connections. Here, two valid AWS connections, their
status, and a public S3 bucket are listed.
• Connection name: A short name of your choice, identifying the AWS account. This name
will be shown as the name of the data location when importing data to or exporting data
from Amazon S3.
• AWS access key ID: The access key ID for programmatic access for your AWS IAM user.
• AWS secret access key: The secret access key for programmatic access for your AWS IAM
user.
The dialog continuously validates the settings entered. When they are valid, the Status box will
contain the text "Valid" and a green icon will be shown. Click on OK to save the settings.
AWS connection status is indicated using colors. Green indicates the connection is valid and
ready for use. Connections to a CLC Genomics Cloud are indicated in the CGC column (figure 6.6).
To submit analyses to the CLC Genomics Cloud, the CLC Cloud Module must be installed and a
license for that module must be available.
AWS credentials entered are stored, obfuscated, in Workbench user configuration files.
Note: Multiple AWS Connections using credentials for the same AWS account cannot be
configured.
Figure 6.8: Provide a public AWS S3 bucket name to enable access to data in that public bucket.
CHAPTER 6. CONNECTIONS TO OTHER SYSTEMS 114
Figure 6.9: After an AWS connection is selected when exporting, you can select the S3 bucket and
location within that bucket to export to.
Chapter 7
Importing data
Contents
7.1 Standard import . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
Many data formats are supported for import into the CLC Main Workbench. Data types that are not
recognized are imported as "external files". Such files are opened in the default application for
that file type on your computer (e.g. Word documents will open in Word). This chapter describes
import of data, with a focus on import of common bioinformatics data formats.
115
CHAPTER 7. IMPORTING DATA 116
The default option is Automatic import. The file format is automatically detected based on a
combination of the file extension (e.g. .fa for fasta) and detection of file contents specific to
particular formats. Based on this, the relevant importer is run. The particular importer used is
recorded in the element history. If the format is not supported, the file is imported as an external
file, that is, it is imported to the CLC Location in its original format. See section 3.2 for details
about working with such files.
You can specify the format explicitly using the option Force import as type. When choosing this
option, the full list of supported data types is provided in a drop-down list.
The Force import as external file option can be useful if you are trying to import a standard
format file, such as a text file, but it is being detected as bioinformatics format file, such as
sequence data.
Standard Import is also used to import files that are dragged from a file browser and dropped
into the Navigation Area. In this case, the file format is automatically detected. To force the file
type, launch the tool explicitly, as described at the start of this section.
Contents
8.1 Data export . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
8.1.1 Export formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
8.1.2 Export parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
8.1.3 Specifying the exported file name(s) . . . . . . . . . . . . . . . . . . . . 121
8.1.4 Export of folders and data elements in CLC format . . . . . . . . . . . . 122
8.1.5 Export of dependent elements . . . . . . . . . . . . . . . . . . . . . . . 123
8.1.6 Export of tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
8.1.7 GFF3 export . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
8.1.8 JSON export . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
8.1.9 Graphics export . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
8.1.10 Export history . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
8.2 Export graphics to files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
8.2.1 File formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
8.3 Export graph data points to a file . . . . . . . . . . . . . . . . . . . . . . . . 138
8.4 Copying and pasting data from an open view . . . . . . . . . . . . . . . . . . 139
Data and graphics can be exported from the CLC Main Workbench using export tools and
workflows that contain export elements, as well using functionality in some right-click menus.
Data can be exported to any location accessible via the system the CLC Main Workbench is
installed on, or to AWS S3 if an AWS Connection is configured that allows this (see section 6.2).
If connected to a CLC Server, data can be exported to areas accessible via the CLC Server
(see section 6.1.1).
117
CHAPTER 8. EXPORTING DATA AND GRAPHICS 118
• Configuring the export parameters (section 8.1.2) including the filename to export to
(section 8.1.3).
When a single data element is selected in the Navigation Area, the export option Export with
Dependent Elements is enabled under the File menu. Using this exporter, the selected element
and any data elements used to create it, are exported, in CLC format, to a single zip file. For
further details, see section 8.1.5.
• If data elements are selected in the Navigation Area before launching the Export tool, then
a "Yes" or a "No" in the Supported formats column specifies whether or not the selected
data elements can be exported to that format. If you have selected multiple data elements
of different types, then formats that some, but not all, selected data elements can be
exported to are indicated by the text "For some elements".
• If no data elements are selected in the Navigation Area when the Export tool is launched,
then the list of export formats is provided, but each row will have a "Yes" in the Supported
format column. After an export format has been selected, only the data elements that can
be exported to that format will be listed for selection in the next step of the export process.
Only zip format is supported when a folder, rather than data elements, is selected for
export. In this case, all the elements in the folder are exported in CLC format, and a zip file
containing these is created. See section 8.1.4.
When the desired export format has been selected, click on the button labeled Select.
CHAPTER 8. EXPORTING DATA AND GRAPHICS 119
Figure 8.1: The Select export format dialog. Here, some sequence lists had been selected in the
Navigation Area before the Export tool was launched. The formats that the selected data elements
can be exported to contain a "Yes" in the Selected format column. Other export formats are listed
below the supported ones, with "No" in the Supported format column.
Figure 8.2: The text field has been used to search for the term "VCF" in the export format name or
description field in the Select export dialog.
A dialog then appears, with a name reflecting the format you have chosen. For example if the
VCF format was selected, the window is labeled "Export VCF".
If you are logged into a CLC Server, you will be asked whether to run the export job using the
Workbench or the Server. After this, you are provided with the opportunity to select or de-select
data to be exported.
Selecting data for export In figure 8.3 we show the selection of a variant track for export to VCF
format.
Further information is available about exporting the following types of information:
Figure 8.3: The Select export dialog. Select the data element(s) to export.
Figure 8.4: Configure the export parameters. When exporting to CLC format, you can choose to
maximize compatibility with older CLC products.
• Maximize compatibility with older CLC products This is described in section 8.1.4.
• Compression options Within the Basic export parameters section, you can choose to
compress the exported files. The options are no compression (None), gzip or zip format.
Choosing zip format results in all data files being compressed into a single file. Choosing
gzip compresses the exported file for each data element individually.
• Paired reads settings In the case of Fastq Export, the option "Export paired sequence lists
to two files" is selected by default: it will export paired-end reads to two fastq files rather
than a single interleaved file.
• Exporting multiple files If you have selected multiple files of the same type, you can choose
to export them in one single file (only for certain file formats) by selecting "Output as single
file" in the Basic export parameters section. If you wish to keep the files separate after
export, make sure this box is not ticked. Note: Exporting in zip format will export only one
zipped file, but the files will be separated again when unzipped.
CHAPTER 8. EXPORTING DATA AND GRAPHICS 121
The name to give exported files is also configured here. This is described in detail in section 8.1.3.
In the final wizard step, you select the location to save the exported files to.
Figure 8.5: The default placeholders, separate by a "." are being used here. The tooltip for the
Custom file name field provides information about these and other available placeholders.
• {counter} - a number that is incremented per file exported. i.e. If you export more than one
file, counter is replaced with 1 for the first file, 2 for the next and so on.
CHAPTER 8. EXPORTING DATA AND GRAPHICS 122
• {year}, {month}, {day}, {hour}, {minute}, and {second} - timestamp information based on
the time an output is created. Using these placeholders, items generated by a tool at
different times can have different filenames.
Note: Placeholders available for Workflow Export elements are different and are described
in section 13.2.4.
Exported files can be saved into subfolders by using a forward slash character / at the start of the
custom file name definition. When defining subfolders, all later forward slash characters in the
configuration, except the last one, are interpreted as further levels of subfolders. For example,
a name like /outputseqs/level2/myoutput.fa would put a file called myoutput.fa into
a folder called level2 within a folder called outputseqs, which would be placed within the
output folder selected in the final wizard step when launching the export tool. If the folders
specified in the configuration do not already exist, they are created. Folder names can also be
specified using placeholders.
Figure 8.6: The file name extension can be changed by typing in the preferred file name format.
If a folder is selected for export, only the zip format is supported. In this case, each data element
in that folder will be exported to CLC format, and all these files will be compressed in a single zip
file.
CLC format files, or zip files containing CLC format data, can be imported directly into a workbench
using the Standard Import tool and selecting "Automatic import" in the Options area.
• A new compression method was introduced with version 22.0 of the CLC Genomics
Workbench, CLC Main Workbench and CLC Genomics Server. Compressed data created
using those versions can be read by version 21.0.5 and above, but not earlier versions.
• Internal compression of CLC data was introduced in CLC Genomics Workbench 12.0, CLC
Main Workbench 8.1 and CLC Genomics Server 11.0. Compressed data created using
these versions is not compatible with older versions of the software. Data created using
these versions can be opened by later versions of the software, including versions 22.0
and above.
Information on how to turn off internal data compression entirely is provided in section 4.4. We
generally recommend, however, that data compression remains enabled.
The exported file will contain compressed CLC format files for the parent data element and its
dependent data elements.
A zip file created this way can be imported by going to:
File | Import ( ) | Standard Import
and selecting "Automatic import" in the Options area.
Figure 8.7: The Export tool has been started and data with tabular content selected as input.
Relevant export formats for this data type are indicated by a "Yes" in the Supported format column.
Details of what will be exported for each option is provided in the Description column.
Figure 8.8: The tabular content of any data type with a table view can be exported. Here, Excel
2010 was selected as the format to export to. The option to export all columns is selected, so the
full table (all columns and rows) will be exported.
• Default Selects the columns defined as standard in the software for the data type.
• Last export Selects the same columns that were selected for the most recent, previous
export.
• Active View Selects the same set of columns as those selected in the Side Panel of the
open data element. This button is visible only if the element being exported is open in
the viewing area when the export tool is launched. See below for an additional method to
export the table content visible in an open view.
CHAPTER 8. EXPORTING DATA AND GRAPHICS 126
Figure 8.9: The Export all columns option was unchecked in the previous wizard step, allowing the
columns to export to be specified in the wizard step shown. The table being exported was open in
a view, so the Export table as currently shown option is available, and the Active View button is
available.
1. Right-click in the table area (figure 8.10), and choose the menu option:
File | Export Table ( )
or
2. Run the standard Export tool (section 8.1) and choose a format relevant for exporting tabular
content. Uncheck the Export all columns option in the launch wizard, and in the subsequent
wizard step, and check the option Export table as currently shown (figure 8.11).
Figure 8.10: Right-click on a table in the viewing area and select Export Table... from under the
File menu to export just the columns and rows displayed in the view.
CHAPTER 8. EXPORTING DATA AND GRAPHICS 127
Figure 8.11: A data element open in the viewing area was selected as input for the Export tool and
the Export all columns option was then unchecked. The Export table as currently shown option is
selected, so just the columns and rows shown in the table view of the open data element will be
exported.
Selections in tables can also be copied, and then pasted into third party applications, as
described in section 8.4.
• Row limits Excel limits the number of hyperlinks in a worksheet to 66,530. When exporting
a table of more than 66,530 rows, Excel will "repair" the file by removing all hyperlinks. If
you want to keep the hyperlinks valid, you will need to subset your data and then export it
to several worksheets, where each would have fewer than 66,530 rows.
• Decimal places When exporting to CSV, tab-separated, or Excel formats, numbers with
many decimals are exported with 10 decimal places, or in scientific notation (e.g. 1.123E-5)
when the number is close to zero.
When exporting a table in HTML format, data are exported with the number of decimals
that have been defined in the CLC Main Workbench preference settings. When tables are
exported in HTML format from a CLC Server the default number of decimal places is 3.
• Decimal notation When exporting to CSV and tab delimited files, decimal numbers are
formatted according to the Locale setting of the CLC Main Workbench (see General
preferences 4.1. If you open the CSV or tab delimited file with software like Excel, that
software and the CLC Workbench should be configured with the same Locale.
Selection to GFF3 File from the menu (figure 14.20). Further information about sequence
annotations is in section 14.3.
• header. Contains information about the version of the JSON exporter and front page
elements included in the report (the front page elements are visible in the PDF export of
the report).
• data. Contains the actual data found in the report (sections, subsections, figures, tables,
text).
• metadata. Contains information about metadata files the report referenced to.
• history. Contains information about the history of the report (as seen in the "Show history"
view).
The data section contains nested elements following the structure of the report:
• The keys of sections (and subsections, etc) are formed from the section (and subsection,
etc) title, with special characters replaced. For example, the section "Counted fragment by
type (total)" is exported to an element with the key "counted_fragments_by_type_total".
• A section is made of the section title, the section number, and all other elements that are
nested in it (e.g., other subsections, figures, tables, text).
• Figures, tables and text are exported to elements with keys "figure_n", "table_n" and
"text_n", n being the number of the elements of that type in the report.
• Figures contain information about the titles of the figure, x axis, and y axis, as well as
legend and data. This data is originally available in the Workbench by double clicking on a
figure in a report and using the "Show Table" view.
• The names of table columns are transformed to keys in a similar way to section titles.
Once exported, the JSON file can be parsed and further processed. For example, using R and
the package jsonlite, reports from different samples can be jointly analyzed. This enables easy
comparison of any information present in the original reports across samples.
library(jsonlite)
library(tools)
library(ggplot2)
The script relies on the following functions to extract the data from the parsed JSON files.
#’ Get the paired distance from a parsed report. Returns null if the reads were
#’ unpaired.
get_paired_distance <- function(parsed_report) {
section <- parsed_report$data$read_quality_control
if (!("paired_distance" %in% names(section))) {
return(NULL)
} else {
figure <- section$paired_distance$figure_1
return(data.frame(sample = basename(file_path_sans_ext(report)),
figure$data))
}
}
#’ Get the figure, x axis, and y axis titles from the paired distance figure
#’ from a parsed report. Returns null if the reads were unpaired.
CHAPTER 8. EXPORTING DATA AND GRAPHICS 130
#’ Re-order the intervals for the paired distances by using the starting value of the interval.
order_paired_distances <- function(paired_distance) {
distances <- unique(paired_distance$distance)
starting <- as.numeric(sapply(strsplit(distances, split = " - "), function(l) l[1]))
distances <- distances[sort.int(starting, index.return = TRUE)$ix]
paired_distance$distance <- factor(paired_distance$distance, levels = distances)
# calculate the breaks used on the x axis for the paired distances
breaks <- distances[round(seq(from = 1, to = length(distances), length.out = 15))]
return(list(data = paired_distance, breaks = breaks))
}
Using the above functions, the script below parses all the JSON reports found in the "exported
reports" folder, to build a read count statistics table (read_count_statistics), and a paired
distance histogram.
• You can export the current view, either the visible area or the entire view, by clicking on
the Graphics button ( ) in the top Toolbar. This is the generally recommended route for
exporting graphics for individual data elements, and is described in section 8.2.
CHAPTER 8. EXPORTING DATA AND GRAPHICS 131
• For some data types, graphics export tools are available from the main Export menu, which
can be opened by clicking on the Export ( ) button in the top Toolbar. These are useful if
you wish to export different data using the same view in an automated fashion, for example
by running the export tool in batch mode or in a workflow context. This functionality is
described below.
• Alignments
• Heat maps
• Read mappings
• Sequences
• Tracks
• Track lists
• Click on the Export ( ) button in the top Toolbar or choose the Export option under the
File menu.
• Type "graphics" in the top field to see just a list of graphics exporters, and then select the
one you wish to use. For example, if you wish to export an alignment as graphics, select
"Alignment graphics" in the list.
• Configure any relevant options. Detailed descriptions of these are provided below.
Options available when exporting sequences, alignments and read mappings to graphics format
files are shown in figure 8.12.
The options available when exporting tracks and track lists to graphics format files are shown in
figure 8.13.
The format and size of the exported graphics can be configured using:
• Graphics format: Several export formats are available, including bitmap formats (such as
.png, .jpg) and vector graphics (.svg, .ps, .eps).
• Width and height: The desired width and height of the exported image. This can be
specified in centimeters or inches.
• Resolution: The resolution, specified in the units of "dpi" (dots per inch).
CHAPTER 8. EXPORTING DATA AND GRAPHICS 132
Figure 8.12: Options available when exporting sequences, alignments and read mappings to
graphics format files.
Figure 8.13: Options available when exporting tracks and track lists to graphics format files.
• View settings: The view settings available for the data type being exported. To determine
how the data will look when a particular view is used, open a data element of the type you
wish to export, click on the Save View button visible at the bottom of the Side Panel, and
CHAPTER 8. EXPORTING DATA AND GRAPHICS 133
apply the view settings in the dialog that appears. View settings are described in section
4.6. Custom view settings will be available to choose from when exporting if the "Save for
all <data type> views" option was checked when the view was saved.
• Region restriction: The region to be exported. For sequences, alignments and read
mappings, the region is specified using start and end coordinates. For tracks and track
lists, you provide an annotation track, where the region corresponding to the full span of
the first annotation is exported. The rest of the annotations in the track have no effect.
Figure 8.14: Select "History PDF" for exporting the history of an element as a PDF file.
Figure 8.15: When exporting the history in PDF, it is possible to adjust the page setup.
CHAPTER 8. EXPORTING DATA AND GRAPHICS 134
Figure 8.16: An example of the top of the exported PDF containing the history of an element
generated using an installed workflow.
• You can export the current view, either the visible area or the entire view, by clicking on
the Graphics button ( ) in the top Toolbar. This is the generally recommended route for
exporting graphics for individual data elements, and is described below.
• For some data types, graphics export tools are available in the main Export menu, which
can be opened by clicking on the Export ( ) button in the top Toolbar. These are useful if
you wish to export different data using the same view in an automated fashion, for example
by running the export tool in batch mode or in a workflow context. That functionality is
described in section 8.1.9.
Figure 8.17: The whole view or just the visible area can be selected for export.
Figure 8.18: A circular sequence, as it looks on the screen when zoomed in.
Figure 8.19: The exported graphics file when Export visible area was selected.
Figure 8.20: The exported graphics file when Export whole view was selected. The whole sequence
is shown, not just the part visible on screen when the view was exported.
Bitmap images In a bitmap image, each dot in the image has a specified color. This implies,
that if you zoom in on the image there will not be enough dots, and if you zoom out there will be
too many. In these cases the image viewer has to interpolate the colors to fit what is actually
looked at. A bitmap image needs to have a high resolution if you want to zoom in. This format is
a good choice for storing images without large shapes (e.g. dot plots). It is also appropriate if
you don't have the need for resizing and editing the image after export.
To produce a high resolution image with all the details of a large element visible, e.g. a large
phylogenetic tree or a read mapping, we recommend exporting to a vector based format.
If Screen resolution and High resolution settings show the same pixel dimensions, this can be
because the maximum supported number of pixels has been exceeded.
Parameters for bitmap formats For bitmap files, clicking Next will display the dialog shown in
figure 8.21.
Figure 8.21: Parameters for bitmap formats: size of the graphics file.
You can adjust the size (the resolution) of the file to four standard sizes:
• Screen resolution
• Low resolution
• Medium resolution
• High resolution
The actual size in pixels is displayed in parentheses. An estimate of the memory usage for
exporting the file is also shown. If the image is to be used on computer screens only, a low
resolution is sufficient. If the image is going to be used on printed material, a higher resolution
is necessary to produce a good result.
Vector graphics Vector graphic is a collection of shapes. Thus what is stored is information
about where a line starts and ends, and the color of the line and its width. This enables a given
viewer to decide how to draw the line, no matter what the zoom factor is, thereby always giving
a correct image. This format is good for graphs and reports, but less usable for dot plots. If the
CHAPTER 8. EXPORTING DATA AND GRAPHICS 138
image is to be resized or edited, vector graphics are by far the best format to store graphics. If
you open a vector graphics file in an application such as Adobe Illustrator, you will be able to
manipulate the image in great detail.
Graphics files can also be imported into the Navigation Area. However, no kinds of graphics files
can be displayed in CLC Main Workbench. See section 3.2 for more about importing external files
into CLC Main Workbench.
Parameters for vector formats For PDF format, the dialog shown in figure 8.22 will sometimes
appear after you have clicked finished (for example when the graphics use more than one page,
or there is more than one PDF to export).
The settings for the page setup are shown. Clicking the Page Setup button will display a dialog
where these settings can ba adjusted. This dialog is described in section 5.2.
It is then possible to click the option "Apply these settings for subsequent reports in this export"
to apply the chosen settings to all the PDFs included in the export for example.
The page setup is only available if you have selected to export the whole view - if you have chosen
to export the visible area only, the graphics file will be on one page with no headers or footers.
Exporting protein reports It is possible to export a protein report using the normal Export
function ( ) which will generate a pdf file with a table of contents:
Click the report in the Navigation Area | Export ( ) in the Toolbar | select pdf
You can also choose to export a protein report using the Export graphics function ( ), but in
this way you will not get the table of contents.
Figure 8.23: A conservation graph displayed along mapped reads. Right-click the graph to export
the data points to a file.
will be shown: If the graph is covering a set of aligned sequences with a main sequence, such
as read mappings and BLAST results, the dialog shown in figure 8.24 will be displayed. These
kinds of graphs are located under Alignment info in the Side Panel. In all other cases, a normal
file dialog will be shown letting you specify name and location for the file.
In this dialog, select whether you wish to include positions where the main sequence (the
reference sequence for read mappings and the query sequence for BLAST results) has gaps.
If you are exporting e.g. coverage information from a read mapping, you would probably want
to exclude gaps, if you want the positions in the exported file to match the reference (i.e.
chromosome) coordinates. If you export including gaps, the data points in the file no longer
corresponds to the reference coordinates, because each gap will shift the coordinates.
Clicking Next will present a file dialog letting you specify name and location for the file.
The output format of the file is like this:
"Position";"Value";
"1";"13";
"2";"16";
"3";"23";
"4";"17";
...
That copied information can then be pasted from the clipboard into other programs.
Copying tabular content Tabular content can be copied directly from an open view and pasted
into text editors or programs designed for tabular data like Excel.
Tabular data is available in table views of data elements, or the Contents view of a folder, which
can be opened by right-clicking a folder in the Navigation Area and choosing the menu option
Show | Content (figure 8.25).
Figure 8.25: The Contents view of a folder with some rows selected.
See section 8.1.6 for detailed information about exporting tabular data from the CLC Main
Workbench.
Copying images of workflows Contents of a Workflow view can be copied and pasted into other
software as an image. Only selected elements and connectors are copied.
To paste an image of an entire workflow into another application, first select all the contents of
the workflow using the keyboard shortcut Ctrl + A ( + Shift + A on mac), and then copy them.
See the the Workflows chapter (section 13) for details about designing and running workflows.
Chapter 9
Contents
9.1 Table view settings and column ordering . . . . . . . . . . . . . . . . . . . . 142
9.2 Filtering tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
General features relevant to many table types are described in this section. For functionality
associated with specific table types, please refer to the manual section describing that particular
data type.
Key functionality available for tables includes:
• Sorting A table can be sorted according to the values of a particular column by clicking a
column header. Clicking once will sort in ascending order. A second click will change the
order to descending. A third click will set the order back its original order.
Pressing Ctrl - on Mac - while you click other columns will refine the existing sorting with
the values of the additional columns, in the order in which you clicked them.
• Configuring the view This includes specifying which columns should be visible and defining
the column order (see section 9.1). View settings can be saved for later use, with a specific
table or for any similar table. (see section or chapter 4.6).
• Displaying only the selected rows Click on the Filter to Selection... button above a table
to update the view to show only the selected rows.
Rows can be selected manually, or by using the "Select in other views" option, which is
available for some tables, generally those with an associated graphical view such as a
Venn diagram, or a volcano plot.
To view the full table again, click on the Filter to Selection... button and choosing the
option Clear selection filter.
• Displaying only rows with content of interest Tables can be interactively filtered using
simple or complex search criteria such that only rows containing content of interest are
shown. Sets of table filters can be saved for re-use. See section or chapter 9.2 for details.
Scroll bars appear at the bottom and at the right of a table when table contents exceed the size
of the viewing area.
141
CHAPTER 9. WORKING WITH TABLES 142
• File |Export Table Export the table to CSV, TSV, HTML or Excel format. Filtering, sorting,
column selection and column order are respected when exporting the table this way.
• Edit | Copy Cell Right-click on a cell and choose this option to copy the contents of that cell
to the clipboard.
The option call Table filters, also available in the right-click menu, is explained in section or
chapter 9.2.
• If saved view settings are applied to a table that contains columns not defined in those
view settings, those columns will be placed at the far right of the table.
• Saved view settings referring to columns not present in the table that they are being applied
to are ignored.
• Automatic Columns are sized to fit the width of the viewing area.
Figure 9.1: A table with all but one available columns visible, and the "Start codon" column moved
to the start of the table from its original location, which was at the end of the table.
2. Move the column to the desired location in the Show columns palette in the Side Panel.
Hover over the column name in the Side Panel, revealing the ( ) icon, then depress the
mouse button and drag the column to the position desired.
The order of the columns in the viewing area is updated automatically.
3. Apply saved view settings where a relevant column order has been defined. See section
or chapter 4.6 for details about applying saved view settings.
Files exported from a table open for viewing, such as .csv files, can be exported using this
custom column order. See section 8.1.6 for details.
CHAPTER 9. WORKING WITH TABLES 144
Simple filtering
The default view of a table supports simple filtering, where rows containing a particular search
term can be entered into a field to the left of the Filter button (figure 9.2). Simple filtering is
enabled when there is an upwards pointing arrow at the top right of the table view. The keyboard
shortcut Ctrl + F (mac: + F) jumps the cursor into the simple filter field. (Clicking on the arrow
beside that field reveals advanced filtering options, which are described later in this section.)
Simple filtering starts automatically, as you type, unless the table has more than 10,000 rows.
In that case, click on the Filter button after typing the term to filter for.
The number of rows with a match to the term is reported in the top left of the table.
The following characters have special meanings when used in the simple filtering field:
• Space Terms separated by spaces are treated as individual search terms unless the terms
are placed within quotes. E.g. the term cat dog would return all rows with the term cat
and/or the term dog in them, in any order.
• Single and double quotes ' and " Enclose a term containing spaces in quotes to search for
exactly that term. E.g. "cat dog" would return rows containing the single term cat dog.
• Backslash Use this term to escape special characters. For example, to search for the term
term "cat" including the quotation marks, enter \"cat\".
• Minus - Please a minus symbol before a termm to exclude rows containing that term. e.g.
-cat -dog would exclude all rows containing either cat or dog.
• Colon : Specify the name of a column to be searched for the term. E.g. Animal:cat
would search for the term cat only in a column called Animal. For this sort of filtering,
please also refer to the advanced filtering information, below.
Figure 9.2: Filtering for rows that contain the term "neg" using the Filter button
CHAPTER 9. WORKING WITH TABLES 145
Advanced filtering
Functionality to define sets of filter criteria is revealed by clicking on the downwards-pointing
arrow at the top right of the table view, (figure 9.3).
Figure 9.3: When the Advanced filter icon is clicked on (top), Advanced filtering fields are revealed
(bottom)
Each filter criterion consists of a column name, an operator and a value. Examples are described
below.
Filter criteria can be added by:
• Right-clicking on a value in the table and selecting the Table filters option from the menu
that appears. Predefined criteria for that column and value combination will be listed
(figure 9.4). Selecting one of these adds it to the list of filters at the top of the table.
Figure 9.4: Right-click on a cell value and choose Table filters to reveal predefined criteria that can
be added to the list of filters for this table.
Match all and Match any options allow you to specify, respectively, whether all criteria must be
met for a row to shown, or whether matching a single criteria is enough for a row to be shown
(figure 9.5).
The number of rows with a match to the term is reported in the top left of the table.
Operators available for columns containing text are listed below. Tests for matches are not case
specific.
CHAPTER 9. WORKING WITH TABLES 146
Figure 9.5: The same two criteria are defined, but with "Match all" selected in the top image, and
"Match any" selected in the bottom image.. Six rows out of 169 match all the criteria, while 154
rows match one or both criteria.
• contains
• doesn't contain
• = Matches exactly
• = Equal to
6 Not equal to
• =
CHAPTER 9. WORKING WITH TABLES 147
Number formatting and filter criterion: The number of digits to display after the decimal separator
(fractional digits) can be set in the CLC Main Workbench Preferences. Thus, there may be more
digits in a number stored in a table than are shown in a view of that table. For this reason,
we recommend using operators that do not require exact matches, such as =, when filtering on
non-integer values.
Figure 9.6: Selecting Save Filters from the menu under the Filter Sets... button (top) opens a dialog
showing the filter criteria and prompting for a name for the filter set (bottom).
CHAPTER 9. WORKING WITH TABLES 148
Figure 9.7: Saved filter sets are listed at the bottom of the drop-down menu revealed when you
click on the Filter Sets... button.
Figure 9.8: Selecting Manage Filters from the menu under the Filter Sets... button (top) opens
the Manage Filters dialog, where saved filter sets can be applied to the open table, or deleted.
Functionality to export and import filter sets is also provided here (bottom).
Chapter 10
Data download
Contents
10.1 Search for Sequences at NCBI . . . . . . . . . . . . . . . . . . . . . . . . . . 149
10.1.1 NCBI search options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
10.1.2 Handling of NCBI search results . . . . . . . . . . . . . . . . . . . . . . 151
10.2 Search for PDB Structures at NCBI . . . . . . . . . . . . . . . . . . . . . . . 152
10.2.1 Structure search options . . . . . . . . . . . . . . . . . . . . . . . . . . 152
10.2.2 Handling of NCBI structure search results . . . . . . . . . . . . . . . . . 153
10.2.3 Save structure search parameters . . . . . . . . . . . . . . . . . . . . . 155
10.3 Search for Sequences in UniProt (Swiss-Prot/TrEMBL) . . . . . . . . . . . . 155
10.3.1 UniProt search options . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
10.3.2 Handling of UniProt search results . . . . . . . . . . . . . . . . . . . . . 157
10.4 Sequence web info . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
CLC Main Workbench offers different ways of searching and downloading online data. You must
be online when initiating and performing the following searches.
• Click on the tab of the search view and drag and drop it into a folder in the Navigation Area.
149
CHAPTER 10. DATA DOWNLOAD 150
These actions save the search query. (It does not save the search results.)
This can be useful when you run the same searches periodically.
• All Fields Searches for the terms provided in all fields of the NCBI database.
• Organism
• Definition/Title
• Modified Search for entries modified within the period specified from a drop-down list.
• Gene Location Choose from Genomic DNA/RNA, Mitochondrion, or Chloroplast.
• Molecule Choose from Genomic DNA/RNA, mRNA or rRNA.
• Sequence Length Enter a number for a maximum or minimum length of the sequence.
• Gene Name
• Accession
Check the "Append wildcard (*) to search words" checkbox to indicate that the term entered
should be interpreted as the first part of the term only. E.g. searching for "genom" with that box
checked would find entries starting with that term, such as "genomic" and "genome".
When you are satisfied with the parameters you have entered, click on the Start search button.
CHAPTER 10. DATA DOWNLOAD 151
• Accession The accession for that entry. Click on the link to open that entry's page at the
NCBI in a web browser.
• Modification date The date the entry was last updated in the database searched
The columns to display can be configured in "Show column" tab of right hand, side panel settings.
Select one or more rows of the table and use buttons at the bottom of the view to:
• Download and Open Sequences are opened in a new view after download is complete.
You can also download and open sequences by dragging selected rows to a new tab area
or by double-clicking on a row.
• Download and Save Sequences are downloaded and saved to a location you specify.
You can also download and save sequences by selecting rows and copying them (e.g. using
Ctrl + C), and then selecting a folder in the Navigation Area and pasting (e.g. using Ctrl +
V).
• Open at NCBI The sequence entry page(s) at the NCBI are opened in a web browser.
CHAPTER 10. DATA DOWNLOAD 152
The functions offered by these buttons are also available in the menu that appears if you
right-click over selected rows.
Note: The modification date on sequences downloaded can be more recent than those reported
in the results table. This depends on the database versions made available for searching at the
NCBI.
Downloading and saving sequences can take some time. This process runs in the background,
so you can continue working on other tasks. The download process can be seen in the Status
bar and it can be stopped, if desired, as described in section 2.4.
As default, CLC Main Workbench offers one text field where the search parameters can be
entered. Click Add search parameters to add more parameters to your search.
Note! The search is a "AND" search, meaning that when adding search parameters to your
search, you search for both (or all) text strings rather than "any" of the text strings.
You can append a wildcard character by clicking the checkbox at the bottom. This means that
you only have to enter the first part of the search text, e.g. searching for "prot" will find both
"protein" and "protease".
The following parameters can be added to the search:
• All fields. Text, searches in all parameters in the NCBI structure database at the same
time.
• Organism. Text.
• Author. Text.
The search parameters are the most recently used. The All fields allows searches in all
parameters in the database at the same time.
All fields also provide an opportunity to restrict a search to parameters which are not
listed in the dialog. E.g. writing 'gene[Feature key] AND mouse' in All fields generates
hits in the GenBank database which contains one or more genes and where 'mouse' ap-
pears somewhere in GenBank file. NB: the 'Feature Key' option is only available in Gen-
Bank when searching for nucleotide structures. For more information about how to use this
syntax, see http://www.ncbi.nlm.nih.gov/entrez/query/static/help/Summary_
Matrices.html#Search_Fields_and_Qualifiers
When you are satisfied with the parameters you have entered click Start search.
Note! When conducting a search, no files are downloaded. Instead, the program produces a list
of links to the files in the NCBI database. This ensures a much faster search.
• Accession.
• Description.
• Resolution.
• Method.
CHAPTER 10. DATA DOWNLOAD 154
• Protein chains
• Release date.
It is possible to exclude one or more of these columns by adjust the View preferences for the
database search view. Furthermore, your changes in the View preferences can be saved. See
section 4.6.
Several structures can be selected, and by clicking the buttons in the bottom of the search view,
you can do the following:
• Download and save. Download and save lets you choose location for saving structure.
• Open at NCBI. Open additional information on the selected structure at NCBI's web page.
Double-clicking a hit will download and open the structure. The hits can also be copied into the
View Area or the Navigation Area from the search results by drag and drop, copy/paste or by
using the right-click menu as described below.
Figure 10.3: By right-clicking a search result, it is possible to choose how to handle the relevant
structure.
CHAPTER 10. DATA DOWNLOAD 155
The selected structures are not downloaded from the NCBI website but is downloaded from the
RCSB Protein Data Bank http://www.rcsb.org/pdb/home/home.do in PDB format.
• Click on the tab of the search view and drag and drop it into a folder in the Navigation Area.
These actions save the search query. (It does not save the search results.)
This can be useful when you run the same searches periodically.
Figure 10.4: Search in UniProtKB by entering search terms and clicking on the "Start search"
button. A table containing information about entries matching the query terms is returned.
Select one of the 2 subsections of UniProtKB to search in, or select both to search all of
UniProtKB.
• Swiss-Prot Searches among manually curated entries. These are the entries marked as
"reviewed" in UniprotKB.
• TrEMBL Searches among computationally analyzed entries that have been annotated using
automated systems. These are the entries marked "unreviewed" in UniprotKB.
Search fields
A single search field is presented by default. Click on "Add search parameters" to add more.
The following options are available:
• All fields Search for the term provided in all fields available at the UniProtKB website
https://uniprot.org/.
• Created Search for entries created within the period specified from a drop-down list.
• Modified Search for entries modified within the period specified from a drop-down list.
• Protein existence. Search for entries with the evidence level specified from a drop-down
list.
When the Append wildcard (*) to search words is checked, the search is broadened to include
entries containing terms starting with text you provided.
Click on the Start search button to run the search.
CHAPTER 10. DATA DOWNLOAD 157
Information about entries meeting all the conditions specified is returned in a table. No data is
downloaded at this point. Working with these results, including downloading entries, is described
in section 10.3.2.
• Click on the tab of the search view and drag and drop it into a folder in the Navigation Area.
These actions save the search query. (It does not save the search results.)
This can be useful when you run the same searches periodically.
• Hit The position of the entry in the results. E.g. 1 for the first entry in the list returned, 2
for the second, and so on.
• Accession The accession of the entry. Clicking on the link opens the entry's page at the
UniprotKB website.
• ID The ID of the entry. Clicking on the link opens the entry's page at the UniprotKB website.
• Protein Existence The level of evidence supporting the existence of the protein.
• Pubmed Entries The list of Pubmed IDs mapped to the entry. Clicking on the link opens a
page listing these Pubmed entries.
• Reviewed Either "reviewed" for entries in Swiss-Prot, or "unreviewed" for entries in TrEMBL.
CHAPTER 10. DATA DOWNLOAD 158
The columns displayed can be customized using in the side panel settings. See section 4.6 for
details.
If you wish to open webpages for several entries at once, highlight the rows of interest and click
on the Open at UniProt button.
• Click on the Download and Save button. You will be prompted for a location to save the
entries to.
• Right-click over a selected area and choose the option Download and Save from the menu
presented.
• Copy (Ctrl-C) to copy the entry information. Click on a folder in the Navigation Area and then
paste (Ctrl-V).
The selected entries are downloaded from UniprotKB. Multiple entries selected at the same time
are saved to a single protein sequence list.
To download and open entries directly in the viewing area, select the rows of interest and then
do one of the following:
• Right-click over a selected area and choose the option Download and Open from the menu
presented.
• Drag the row(s) until the mouse cursor is next to an existing tab in the view area. When the
mouse button is released, a new tab is opened, and the selected entries are downloaded
and opened in that tab.
references at NCBI. This is useful for quickly obtaining updated and additional information about
a sequence.
The functionality of these search functions depends on the information that the sequence
contains. You can see this information by viewing the sequence as text (see section 14.5). In
the following sections, we will explain this in further detail.
The procedure for searching is identical for all four search options (see also figure 10.5):
Open a sequence or a sequence list | Right-click the name of the sequence | Web
Info ( ) | select the desired search function
This will open your computer's default browser searching for the sequence that you selected.
Google sequence The Google search function uses the accession number of the sequence
which is used as search term on https://www.google.com. The resulting web page is
equivalent to typing the accession number of the sequence into the search field on https:
//www.google.com.
PubMed References The PubMed references search option lets you look up Pubmed articles
based on references contained in the sequence file (when you view the sequence as text it
contains a number of "PUBMED" lines). Not all sequence have these PubMed references, but in
this case you will se a dialog and the browser will not open.
UniProt The UniProt search function searches in the UniProt database (https://uniprot.
org/) using the accession number. Furthermore, it checks whether the sequence was indeed
downloaded from UniProt.
Additional annotation information When sequences are downloaded from GenBank they often
link to additional information on taxonomy, conserved domains etc. If such information is
available for a sequence it is possible to access additional accurate online information. If the
db_xref identifier line is found as part of the annotation information in the downloaded GenBank
file, it is possible to easily look up additional information on the NCBI web-site.
CHAPTER 10. DATA DOWNLOAD 160
To access this feature, simply right click an annotation and see which databases are available.
For tracks, these links are also available in the track table.
Chapter 11
Contents
11.1 Running tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
11.1.1 Running a tool on a CLC Server . . . . . . . . . . . . . . . . . . . . . . . 165
11.2 Handling results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
11.3 Batch processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
This section describes how to run tools, and how to handle and inspect results. We cover
launching tools for individual runs, as well as launching them in batch mode, where the tool is
run multiple times in a hands-off manner, using different input data for each run.
Launching workflows, individually or in batch mode, as well as running sections of workflows in
batch mode, are covered in chapter 13.
• Double click on its name in a tab the Toolbox panel in the bottom, left side of the
Workbench.
• Select it from the Tools or Workflows menu at the top of the Workbench.
161
CHAPTER 11. RUNNING TOOLS, HANDLING RESULTS AND BATCHING 162
• Select the element(s) to analyze and drag them from the Navigation Area onto the name of
a tool or workflow in the Toolbox panel.
• Click on the Quick Launch ( ) option at the top of the Tools menu or Workflows menu.
• Click on the ( ) icon beside the search field at the top of the Tools tab or Workflows tab
in the Toolbox panel at the bottom, left of the Workbench.
Figure 11.1: Tools, installed workflows and template workflows can be quickly found and launched
using the Quick Launch tool.
Double-click on a row to launch a tool or workflow from the Quick Launch dialog. Alternatively,
select a row and click on the Open button.
When terms are entered into the text field at the top of the Quick Launch dialog, only tools
and workflows with matches to those terms in their name, description or path will be listed
(figure 11.2). Surround terms with single or double quotes to search for specific terms with
spaces in them, for example "sequence list".
The Path column contains the location of tools and workflows relative to the Tools or Workflows
menu, respectively. Functionality available under other menus includes the relevant menu name
in the path.
Click on the Favorites tab to see the subset of tools that are frequently used or have been
selected as favorites (see section 2.3).
For tools where names have changed between Workbench versions, searches using terms in the
older name will still find the relevant tool.
CHAPTER 11. RUNNING TOOLS, HANDLING RESULTS AND BATCHING 163
Figure 11.2: Typing a term in the search field limits the list of tools and workflows to those with
that term in their name, description or path.
You can move forward and back through the wizard steps by clicking the buttons Next and
Previous, respectively, which are present at the bottom of the wizard. Clicking on the Help
button in the bottom left corner of the launch wizard opens the documentation for the tool being
launched.
The rest of this section covers the general launch wizard steps in more detail.
Specify the execution environment
If more than one execution environment is available, and a default selection has not already
been set, the first wizard step will offer a list of the available environments.
For example, if you are logged into a CLC Server, or if you have the CLC Cloud Module installed
and an AWS Connection has been configured with credentials giving access to a CLC Genomics
Cloud setup, you are offered the option of running the job in different execution environments
(figure 11.3).
Information on about launching jobs on a CLC Server is provided in section 11.1.1.
Select the input data for analysis tools
CHAPTER 11. RUNNING TOOLS, HANDLING RESULTS AND BATCHING 164
Figure 11.3: This Workbench has the CLC Cloud Module installed and has an active AWS Connection
to a CLC Genomics Cloud setup. Thus, this job could be run on the Workbench, or run on AWS by
selecting the option CLC Genomics Cloud.
When selecting data to use as input to a tool, a view of the Navigation Area is presented, listing
the elements that could be selected as input, as well as folders (figure 11.4). The data types
that can be used as input for a given tool are described in the manual section about that tool.
Figure 11.4: You can select input files for the tool from the Navigation Area view presented on the
left hand side of the wizard window.
Selected elements will be listed in the right hand pane. To select the inputs, you can:
• Double click on them in the Navigation Area view in the launch wizard, or
• Select them with a single click in the Navigation Area view in the launch wizard and then
click on the right hand arrow.
• Before opening the launch wizard, pre-select data elements in the main Navigation Area of
the Workbench. When the tool is launched, these elements will automatically be placed in
the "Selected elements" list.
To remove entries from the "Selected elements" list, double-click on them or select them with a
single click and then click on the left hand arrow.
When multiple elements are selected, most analysis tools will analyze them together, as a single
input, unless the "Batch" option at the bottom is checked. With the "Batch option checked,
the tool is run multiple times, once for each "batch unit", which may be a data element, or a
CHAPTER 11. RUNNING TOOLS, HANDLING RESULTS AND BATCHING 165
folder containing data elements or containing folders of elements. Batch processing is described
in section 11.3.
Select the input data for import tools
Selecting files for import is described in chapter 7. It is generally similar to selecting input for
analysis tools, but involves selecting files from a file system or remote location. Many import
selection wizards also support drag-and-drop for selecting files to import.
Configure the available options for the tool
Depending on the tool, there may be one or more wizard steps containing options affecting how
the tool behaves (figure 11.5).
Clicking on the Reset button resets the values for the options in that wizard step to their default
values.
• Workbench. Run the analysis on the computer the CLC Workbench is running on.
• Server. Run the analysis using the CLC Server. For job node setups, analyses will be run on
the job nodes.
• Grid. Only offered if the CLC Server setup has grid nodes. Here, jobs are sent from the
master CLC Server to be run on grid nodes. The grid queue to submit to can be selected
from the drop down list under the Grid option.
You can check the Remember setting and skip this step option if you wish to always use the
selected option when submitting analyses. If you select this option but later change your mind,
just start up an analysis and click on the Previous button to open these options again.
CHAPTER 11. RUNNING TOOLS, HANDLING RESULTS AND BATCHING 166
Figure 11.6: When logged into the CLC Server, you can select where a job should be run.
Most wizard steps for launching a job on a CLC Workbench or on a CLC Server are the same.
There are two minor differences when launching jobs to run on a CLC Server: results are always
saved, and a log of the job is always created and saved alongside the results.
Data access: When you run a job on a CLC Server, you will generally only be able to select data
from and save results to areas known to the CLC Server. With default server settings, you will not
be able to upload data from your local system. Your server administrator can enabled this if they
wish. See https://resources.qiagenbioinformatics.com/manuals/clcserver/current/admin/index.php?
manual=Direct_data_transfer_from_client_systems.html.
Disconnecting from the CLC Server: Once the job has been submitted, you can disconnect
from the CLC Server if you wish, or close the CLC Workbench entirely. Exception: If you are
importing data from the local file system, you must wait until the data has been imported before
disconnecting. A notification about server jobs that finished is presented the next time you log in
to the CLC Server. See section 2.4.
• Open. This will open the result of the analysis in a view. This is the default setting.
• Save The results will be saved rather than opened. You will be prompted for where you
wish the results to be saved (figure 11.7). You can save to an existing area or create a new
folder to save the results into.
You may also have an option called "Open log". If checked, a window will open in the View area
after the analysis has started and the progress of the job will be reported there line by line.
Click Finish to start the analysis.
If you chose the option to open the results, they will open automatically in one or several tabs in
the View Area. The data will not have been saved at this point. The name of each tab is in bold,
CHAPTER 11. RUNNING TOOLS, HANDLING RESULTS AND BATCHING 167
appended with an asterisk to indicate this. There are several ways to save the results you wish
to keep:
• Select the tab and then use the key combination Ctrl + S (or + S on macOS).
• Right click on the tab and choose "Save" from the context menu.
• Go to the File menu and select the option "Save" or "Save As...".
If you chose to save the results, they will have been saved in the location specified. You can
open the results in the Navigation Area directly after the analysis is finished. A quick way to find
the results is to click on the little arrow to the right of the analysis name in the Processes tab
and choose the option "Show results" or "Find Results", as shown in figure 11.8.
Figure 11.8: Find or open the analysis results by clicking on the little arrow to the right of the
analysis name in the Processes tab and choosing the relevant item from the menu.
CHAPTER 11. RUNNING TOOLS, HANDLING RESULTS AND BATCHING 168
Batch mode
Batch mode is activated by clicking the Batch checkbox in the dialog where the input data is
selected (figure 11.9).
Figure 11.9: When launching an analysis in Batch mode, individual elements and/or folders can be
selected. Here, a single folder that contains both elements and subfolders of elements has been
selected.
In Batch mode, the analysis is run once per batch unit. A batch unit consists of the data elements
to be analyzed together. A batch unit can be a single data element, or can consist of multiple
data elements.
Batch units are made up of:
a type compatible as input to the analysis, are the default contents of a batch unit. See
figure 11.10 and figure 11.11.
Figure 11.10: When the Batch box is checked, a CLC Metadata Table can be selected as input.
Figure 11.11: Data associated with each row in a CLC Metadata Table, of a type compatible with
that analysis, make up the default content of batch units.
Batch overview
In the batch overview step, the elements in each batch unit can be reviewed, and refined based
on their names using the fields Only use elements containing and Exclude elements containing.
In figure 11.12, the batch units, i.e. those elements and folders directly under the folder selected
in figure 11.9, are shown. In each batch unit, data elements that could be used in the analysis
are listed on the right hand side. Some batch units contain more than one data element. Those
data elements would be analyzed together. To limit the analysis to just sequence lists containing
trimmed sequences, the term "trim" has been entered into a filter field near the bottom.
Folders that do not contain any elements compatible with the analysis are not shown in the batch
overview.
• Save in input folder Save all outputs into the same folder as the input data. For batch units
defined by folders, the results of each analysis are saved into the folder with the input
CHAPTER 11. RUNNING TOOLS, HANDLING RESULTS AND BATCHING 170
Figure 11.12: Overview of the batch units (left) and the input elements defined by each batch unit
(right). By default, all elements that can be used as inputs are listed on the right (top). By entering
terms in the filter fields, the list of elements in the batch units can be refined. Here, only sequence
lists including trimmed sequences will be included (bottom) .
Figure 11.13: Options for saving results when an analysis is runin Batch mode.
data. If the batch units were individual data elements, results are put into the same folder
as the input elements.
• Save in specified location You will be prompted in the next step to select a folder where
the outputs should be saved to. The Create subfolders per batch unit checkbox allows you
to specify whether subfolders should be created to store the results from each batch unit:
When checked results for each batch unit are written to a newly created subfolder
under the folder you select in the next step. A subfolder is created for each batch unit.
(This is the default option.)
CHAPTER 11. RUNNING TOOLS, HANDLING RESULTS AND BATCHING 171
When unchecked, results from all batch units are written to the folder you select in
the next step.
Metadata
Contents
12.1 Creating metadata tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
12.1.1 Importing metadata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
12.1.2 Creating a metadata table directly in the Workbench . . . . . . . . . . . 176
12.2 Associating data elements with metadata . . . . . . . . . . . . . . . . . . . 180
12.2.1 Associate Data Automatically . . . . . . . . . . . . . . . . . . . . . . . . 181
12.2.2 Associate Data with Row . . . . . . . . . . . . . . . . . . . . . . . . . . 183
12.3 Working with data and metadata . . . . . . . . . . . . . . . . . . . . . . . . 184
12.3.1 Finding data elements based on metadata . . . . . . . . . . . . . . . . . 184
12.3.2 Viewing metadata associations . . . . . . . . . . . . . . . . . . . . . . . 185
12.3.3 Removing metadata associations . . . . . . . . . . . . . . . . . . . . . . 186
12.3.4 Identifying metadata rows without associated data . . . . . . . . . . . . 187
12.3.5 Editing Metadata tables . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
12.4 Moving, copying and exporting metadata . . . . . . . . . . . . . . . . . . . . 191
Metadata refers to information about data. In the context of the CLC Main Workbench, this usually
means information about samples. For example a set of reads could come from a particular
specimen at a particular time point with particular characteristics. The specimen, time and
characteristics would be metadata for that set of reads.
Examples in this chapter refer to tools present in the CLC Genomics Workbench, but the principles
apply to other CLC Workbenches.
What is metadata used for? Core uses of metadata in CLC software include:
• Defining batch units when launching workflows in batch mode, described in section 13.3.2.
• Distributing data to the relevant input channels in a workflow when using Collect and
Distribute elements, described in section 13.2.5.
• Finding and selecting data elements based on sample information (in a CLC Metadata
Table). Workflow Result Metadata tables are of particular use when reviewing results
generated by workflows run in batch mode and are described in section 13.3.1.
172
CHAPTER 12. METADATA 173
• Running tools where characteristics of the data elements are relevant. An example is
Differential Expression for RNA-Seq in the CLC Genomics Workbench, described at https://
resources.qiagenbioinformatics.com/manuals/clcgenomicsworkbench/current/index.php?manual=Differential_
Expression_RNA_Seq.html.
Metadata tables
An example of a CLC Metadata Table in the CLC Main Workbench is shown in figure 12.1. Each
column represents a property of a sample (e.g., identifier, height, age, treatment) and each row
contains information relevant to a sample. A single column can be designated the key column.
That column must contain unique entries.
Figure 12.1: A simple metadata table, with the key column highlighted in blue.
Each row can have associations with one or more data elements, such as sequence lists,
expression tracks, variant tracks, etc. Associating data elements with relevant metadata rows,
automatically or manually, is covered in section 12.2
Information from an Excel, CSV or TSV format file can be imported into a CLC Metadata Table, as
described in section 12.1.1. CLC Metadata Tables are also generated by workflows, as described
in section 13.3.1.
Figure 12.2: A CLC Metadata Table and corresponding Metadata Elements table showing elements
associated with sample 27T.
• Create a new CLC Metadata Table containing a subset of the rows in another CLC Metadata
Table.
To do this, open an existing CLC Metadata Table, select the rows of interest and click on
the Create New Metadata Table... ( ) button at the bottom of the editor. This option is
also available in the menu that opens when you right-click on the selection (figure 12.3).
Data elements with associations to the selected rows aquire an association with the new
CLC Metadata Table also.
Workflow Result Metadata tables, created when a workflow is run, are also CLC Metadata Tables.
These are described in section 13.3.1.
Figure 12.3: Selected rows in a CLC Metadata table can be put into a new CLC Metadata Table
using the option "Create New Metadata Table..."
The first column in the selected file must have unique entries. That column will be designated
as the key column. A different column can be specified as the key column later. For the optional
step for association of data elements to work, the first column must contain entries that can be
matched with the relevant data element names.
Select the Excel, CSV or TSV format file with metadata to be imported. The rows in that file are
displayed in the Metadata preview window (figure 12.4).
The format of the columns should reflect the column contents. The designated format can be
changed from within the Metadata Table editor, as described in section 12.3.5. There, you
can change the column data types (e.g. to types of numbers, dates, true/false) and you can
designate a new key column.
Figure 12.4: Rows being imported from a file containing metadata are shown in the Metadata
preview table.
CHAPTER 12. METADATA 176
Associating metadata with data (optional) The "Associate with data" wizard step (figure 12.5),
is optional. To proceed without associating data to metadata, click on the Next button. Associating
data with metadata can be done later, as described in section 12.2.
To associate data with the metadata:
• Click on the file browser button to the right of the Location of data field
• Select the matching scheme to use: Exact, Prefix or Suffix. These options are described in
section 12.2.1.
Figure 12.5: Three data elements are selected for association. The "Prefix" partial matching
scheme is selected for matching data element names with the appropriate metadata row, based
on the information in the Sample ID column in this case.
The Data association preview area shows data elements that will have associations created,
along with information from the metadata row they are being linked with. This gives the opportunity
to check that the matching is leading to the expected links between data and metadata.
You can then select where you wish the metadata table to be saved and click on Finish.
The associated information can be viewed for a given data element in the Show Element Info
view (figure 12.6).
Figure 12.6: Metadata associations can be seen, edited, refreshed or deleted via the Show Element
Info view.
Defining the table structure Click Setup Table at the bottom of the view (figure 12.7).
To create a metadata table from scratch, use the "Add column right" or "Add column left" buttons
( ) to define the table structure with the amount of columns you will need, and edit the fields
of each column as needed.
To import the table from a file, click on Setup Structure from File. In the dialog that appears
(figure 12.8), you need to provide the following information:
• Filename The EXCEL or delimited TEXT file to import. Column names should be in the first
row of this file.
• Encoding For text files only: the encoding used to create the file. The default is UTF-8.
CHAPTER 12. METADATA 178
• Separator For text files only: The character used to separate the columns. The default is
semicolon (;).
For each column in the external file, a column will be created in the new metadata table. By
default the type of these imported columns is "Text". You will see a reminder to set the column
type for each column and to designate one of the columns as the key column.
Populating the table Click on Manage Data button at the bottom of the view (figure 12.9).
Figure 12.9: Tool for managing the metadata itself. Notice the button labeled Import Rows from
File.
The metadata table can then be populated by editing each column manually. Row information is
added manually by clicking on the ( ) button and typing in the information for each column.
It is also possible to import information from an external file. In that case, the column names in
the metadata table in the workbench will be matched with those in the external file to determine
which values go into which cell. Only cell values in columns with an exact name match will
be imported. If the file used contains columns not in the metadata table, the values in those
columns will be ignored. Conversely, if the metadata table contains columns not present in the
file, imported rows will have no values for those columns.
CHAPTER 12. METADATA 179
Click on Import Rows from File and select the external file of metadata. This brings up the
window shown in figure 12.10.
When working with an existing metadata table and adding extra rows, it is generally recommended
that a key column be designated first. If a key column is not present, then all rows in the file
will be imported. With no key column designated, if any rows from that file were imported into
the same metadata table earlier, a duplicate row will be created. With a key column, rows with
a new, unique entry for that column are added to the table and existing rows with a key entry in
the file will be updated, incorporating any changes present in the file. Duplicate rows will not be
created.
The options presented in the Import Metadata Rows into Metadata Table are:
• File. The file containing the metadata to import. This can be Excel (.xlsx/.xls) format or a
delimited text file.
• Encoding. For text files only: The text encoding of the seledcted file. Specifying the correct
encoding is important to ensure that the file is correctly interpreted.
• Separator. For text files only: the character used to separate columns in the file.
• Locale. For text files only: the locale used to format numbers and dates within the file.
• Date format. For text files only: the date format used in the imported file.
• Date-time format. For text files only: the date-time format used in the imported file.
The date and date-time templates uses the Java patterns for date and time formatting.
Meaning of some of the symbols:
CHAPTER 12. METADATA 180
With a short year format (YY), 2000 will be added when imported as, or converted to, Date
or Date and time format. Thus, when working with dates before the year 2000 or after
2099, please use a four digit format for the year (YYYY).
Click the button labeled Finish button when the necessary fields have been filled in.
The progress and status of the row import can be seen in the Processes tab in the Toolbox area,
at the bottom, left side of the Workbench. Any errors resulting from an import that failed can be
reviewed here. The most frequent errors are associated with selecting the wrong separator or
encoding, or wrong date/time formats when importing rows from delimited text files.
Once the rows are imported, The metadata table can be saved.
• By default, when input data for an analysis is associated with metadata, the results will
inherit any unambiguous association. Appropriate role labels are assigned by the analysis
CHAPTER 12. METADATA 181
tool. For example, a read mapping tool will assign the role "Unmapped reads" to a sequence
list of unmapped reads that it produces.
• By default outputs from a workflow are associated with the relevant metadata rows in
workflow results metadata tables. In these tables, the role assigned is always "Result
data".
• Manually triggering data associations, either through matching the metadata key column
entries with data element names, or by specifying the data element to associate with a
given row. Here, roles to apply are chosen by you when triggering the associations.
The rest of this section describes this last point, where you associate data elements to metadata.
To do this, open a metadata table, and then click on the Associate Data button at the bottom of
the Metadata Table view. Two options are available:
• Associate Data with Row Manually make associations row by row, by selecting a row of
the metadata and a particular data element in the Navigation Area. Here, information in the
metadata table does not need to match data element names. This option is also available
when right-clicking a row in the table. section 12.2.2.
In the Association setup step, you specify whether the matching of the data element names to
the entries in the key column should be based on exact or partial matching (described below).
A preview showing how elements are matched to metadata rows using the selected matching
scheme is shown in the wizard (figure 12.12).
CHAPTER 12. METADATA 182
You also specify a role for each element. The default role provided is "Sample data". You can
specify any term you wish.
Figure 12.12: Data element names can be matched either exactly or partially to the entries in the
key column. Here, the Prefix matching scheme has been selected. A preview showing how elements
are matched to metadata rows using that scheme is shown in the Data association preview area,
at the bottom.
After the job has run, data associations and roles are saved for all the selected data elements
where the name matches a key column entry according to the selected matching scheme.
Note: Data elements selected that already have associations with the CLC Metadata Table will
have their associations updated to reflect the current information in the CLC Metadata Table.
This means associations will be deleted for a selected data element if there are no rows in the
metadata table that match the name of that data element. This could happen if, for example,
you changed the name of a data element with a metadata association, and did not change the
corresponding key entry in the metadata table.
Matching schemes A data element name must match an entry in the key column of a metadata
table for an association to be set up between that data element at the corresponding row of the
metadata table. Two schemes are available in the Association Data Automatically for matching
up names with key entries:
• Exact - data element names must match a key exactly to be associated. If any aspect of the
key entry differs from the name of a selected data element, no association will be created.
• Prefix - data elements with names partially matching a key will be associated: here the first
whole part(s) of a name must match a key entry in the metadata table for an association
to be established. This option is explained in more detail below.
• Suffix - data elements with names partially matching a key will be associated: here the last
whole part(s) of a name must match a key entry in the metadata table for an association
to be established. This option is explained in more detail below.
Partial matching rules For each data element being considered, the partial matching scheme
involves breaking a data element name into components and searching for the best match from
CHAPTER 12. METADATA 183
the key entries in the metadata table. In general terms, the best match means the longest key
that matches entire components of the name.
The following describes the matching process in detail:
• Break the data element name into its component parts based on the presence of delimiters.
It is these parts that are used for matching to the key entries of the metadata table.
Delimiters are any non-alphanumeric characters. That is, anything that is not a letter (a-z
or A-Z) or number (0-9). So, for example, characters like hyphens (-), plus symbols (+),
spaces, brackets, and so on, would be used as delimiters.
If partial matching was chosen with a data element called Sample234-1 (mapped)
(trimmed) would be split into 4 parts: Sample234, -1, (mapped) and (trimmed).
• Matches are made at the component level. A whole key entry must match perfectly to at
least the first (with the Prefix option) or the last (with the Suffix option) complete component
of a data element name.
For example, a key entry Sample234 would be a match to the data element with name
Sample234-1 (mapped) (trimmed) because the whole key entry matches the whole
of the first component of the data element name. Conversely, if they key entry had been
Sample23, no match would be identified, because they whole key entry does not match to
at least the whole of the first component of the data element name.
In cases where a data element could be matched to more than one key, the longest key
matched determines the metadata row the data will be associated with.
The table below provides examples to illustrate the partial matching system, on a table
that has the keys with sample IDs like in figure 12.13) (i.e., ETC-001, ETC-002, . . . ,
ETC-013),
Data Element Key Reason for association
ETC-001 (Reads) ETC-001 Key ETC-001 matches the first part of the name
ETC-001 un-m. . . (single) ETC-001 ''
ETC-001 un-m. . . (paired) ETC-001 ''
ETC-002 ETC-002 Key ETC-002 matches the whole name
ETC-003 None No keys match this data element name
ETC-005 ETC-005 Key ETC-005 matches the whole name
ETC-005-1 ETC-005 Key ETC-005 matches the first part of the name
ETC-006-5 ETC-006 Key ETC-006 matches the first part of the name
ETC-007 None No keys match this data element name
ETC-007 (mapped) None ''
ETC-008 None ''
ETC-008 (report) None ''
ETC-009 ETC-009 Key ETC-009 matches the whole name
To associate data elements with a particular row in the metadata table, select the desired row in
the metadata table by clicking on it. Then either click the Associate Data button at the bottom of
the Metadata Table view, or right-click on the selected metadata row and choose the Associate
Data with Row option (as seen in figure 12.13).
A window will open within which you can select the data elements that should have an association
with the metadata row.
If a selected data element already has an association with this particular metadata table, that
association will be updated. Associations with any other metadata tables will be left as they are.
Enter a role for the data elements that have been chosen and click Next until you can choose to
Save the outputs. Data associations and roles will be saved for the selected data elements.
• Click on the Find Associated Data button at the bottom of the view.
A table with a listing of the data elements associated to the selected metadata row(s) will
appear (figure 12.14).
The search results table shows the type, name, and navigation area path for each data element
found. It also shows the key entry of the metadata table row with which the element is associated
and the role of the data element for this metadata association. In figure 12.14, there are five
data elements associated with sample ETC-009. Three are Sequence Lists, two of which have a
role that tells us that they are unmapped reads resulting from the Map Reads to Reference tool.
Clicking the Refresh button will re-run the search and refresh the search results table.
Click the button labeled Close to close the search table view.
Data elements listed in the search result table can be opened by clicking on the button labeled
Show at the bottom of the view.
CHAPTER 12. METADATA 185
Alternatively, they can be highlighted in the Navigation Area by clicking the Find in Navigation
Area button.
Analyses can be launched on the selected data elements:
• Directly. Right click on one of the selected elements, choose the menu option Tools, and
navigate to the tool of interest. The data selected in the search results table will be listed
as selected elements in the launch wizard.
• Via the Navigation area selection. Use the Find in Navigation Area button and then launch
a tool or workflow. The items that were selected in the Navigation area will be pre-selected
in the launch wizard.
If no data elements with associations are found and this is unexpected, please re-index the
locations your data are stored in. This is described in section 3.4. For data held in a CLC Server
location, an administrator will need to run the re-indexing. Information on this can be found in
the CLC Server admin manual at https://resources.qiagenbioinformatics.com/manuals/clcserver/
current/admin/index.php?manual=CLC_Server_File_System_Location_indexes.html.
• Edit will allow you to change the role of the metadata association.
• Refresh will reload the metadata details from the Metadata Table; this functionality may
be used to attempt to re-fetch metadata that was previously unavailable, e.g. due to server
connectivity.
4. In the Metadata Elements table that opens, highlight the rows for the data elements the
metadata associations should be removed from.
5. Right-click over the highlighted area and choose the option Remove Association(s) (figure
12.16). Alternatively, use the Delete key on the keyboard, or on a Mac, the fn and
backspace keys at the same time.
Metadata associations can also be removed from within the Element info view for individual data
elements, as described in section 12.3.2.
When an metadata association is removed from a data element, this update to the data element
is automatically saved.
CHAPTER 12. METADATA 187
Figure 12.16: Removing metadata associations to two data elements via the Metadata Elements
table.
Figure 12.17: Click on the Edit Table... button to open a menu with options for adding, editing or
removing information in a CLC Metadata table.
Figure 12.18: Right-click on selected rows of a CLC Metadata Table to open a menu actions that
can be taken.
Navigate between entries using the buttons on the right. Modifications made take effect as you
navigate to another row, or if you close the dialog using Done.
CHAPTER 12. METADATA 189
Right-click on an individual row in the table and select the Edit Entry.. ( ) option to edit just
that entry. An option to delete rows is also in this menu: Delete Row(s) (figure 12.18).
Figure 12.19: Additional information can be imported to an existing CLC Metadata table. You can
choose whether new information should be added to existing entries, and whether rows should be
added for new entries. The columns to import can also be specified.
Individual rows can also be added using the ( ) button, which inserts a new row after the
current one.
Rows may be deleted using the ( ) button.
The ( ) and ( ) buttons are used to undo and redo changes respectively.
Figure 12.20: When adding a new column, a name, description and data type is specified. If it
should become the key column, the Key column box should be checked. Use the buttons on the
right to navigate to other columns or add further new columns.
Figure 12.21: The Name column has been designated as the key column.
• Description. An optional description of the information that will be held in the column. The
description will appear as a tool tip, visible when you hover the mouse cursor over the
column name in the metadata table.
• Key column. Any column containing only unique values can be designated as the key
column. If a table already has a key column, this option is disabled for other columns.
Information in the key column is used when automatically creating associations from data
elements, described in (section section 12.2.1).
• Type. The type of value allowed. The default data type for columns on import is text, but
this can be edited to the following types:
• The data element copies will have associations with the new copy of the metadata table.
The original elements keep their associations with the original metadata table.
• If a metadata table is copied but data elements with associations to it are not also copied
in that action, those data elements will be associated with both the copy and the original
metadata table.
• If data elements with associations to metadata are copied, but no metadata table is
involved in the same copy action, each data element copy will be associated to the same
metadata as the original element.
If a metadata table and some, but not all, data elements with associations to it, are copied in a
single action, then:
CHAPTER 12. METADATA 192
• The data element copies will have associations to the copy of the metadata table, while the
original elements (that were copied) remain associated with the original metadata table.
• Elements with associations to the original metadata table that were not copied will have
associations to both the original metadata table and the copy. However, if these data
elements are later copied (in a separate copy operation), those copies will only be
associated with the original metadata table. If they should be associated with the copy of
the metadata table, those association must be added as described in section 12.2.
Exporting metadata
The standard Workbench export functionality can be used to export metadata tables to various
formats. The system's default locale will be used for the export, which will affect the formatting
of numbers and dates in the exported file.
See section 8.1 for more information.
Chapter 13
Workflows
Contents
13.1 Creating and editing workflows . . . . . . . . . . . . . . . . . . . . . . . . . 195
13.1.1 Adding elements to a workflow . . . . . . . . . . . . . . . . . . . . . . . 195
13.1.2 Connecting workflow elements . . . . . . . . . . . . . . . . . . . . . . . 197
13.1.3 Ordering inputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
13.1.4 Validating a workflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
13.1.5 Viewing the flow of elements in a workflow . . . . . . . . . . . . . . . . . 204
13.1.6 Adjusting the workflow layout . . . . . . . . . . . . . . . . . . . . . . . . 204
13.1.7 The Configuration Editor view . . . . . . . . . . . . . . . . . . . . . . . . 204
13.1.8 Snippets in workflows . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
13.1.9 Customizing the Workflow Editor . . . . . . . . . . . . . . . . . . . . . . 210
13.2 Workflow elements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215
13.2.1 Anatomy of workflow elements . . . . . . . . . . . . . . . . . . . . . . . 215
13.2.2 Basic configuration of workflow elements . . . . . . . . . . . . . . . . . 217
13.2.3 Configuring Workflow Input elements . . . . . . . . . . . . . . . . . . . . 221
13.2.4 Configuring Workflow Output and Export elements . . . . . . . . . . . . . 224
13.2.5 Control flow elements . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228
13.2.6 Input modifying tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240
13.3 Launching workflows individually and in batches . . . . . . . . . . . . . . . . 240
13.3.1 Workflow Result Metadata tables . . . . . . . . . . . . . . . . . . . . . . 242
13.3.2 Running workflows in batch mode . . . . . . . . . . . . . . . . . . . . . 243
13.3.3 Running part of a workflow multiple times . . . . . . . . . . . . . . . . . 248
13.4 Advanced workflow batching . . . . . . . . . . . . . . . . . . . . . . . . . . . 252
13.4.1 Batching workflows with more than one input changing per run . . . . . . 252
13.4.2 Multiple levels of batching . . . . . . . . . . . . . . . . . . . . . . . . . 254
13.5 Template workflows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255
13.5.1 Trim and Map Sanger Sequences . . . . . . . . . . . . . . . . . . . . . . 256
13.6 Managing workflows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257
13.6.1 Updating workflows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260
13.6.2 Workflow installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263
13.6.3 Using workflow installation files . . . . . . . . . . . . . . . . . . . . . . . 267
193
CHAPTER 13. WORKFLOWS 194
Figure 13.1: A workflow open in the Workflow Editor. Workflows consist of connected tools, where
the output of one tool is used as input for another tool.
Examples in this chapter use tools from the CLC Genomics Workbench. Some of these are not
available in the CLC Main Workbench. However, the principles described apply equally to tools in
CLC Main Workbench.
CHAPTER 13. WORKFLOWS 195
Figure 13.2: Template workflows are available under the Workflows menu.
To copy the image of a workflow design, select the elements in the workflow design (click in the
workflow editor and then press keys Ctrl + A), then copy (Ctrl + C), and then paste where you
wish the image to be placed, for example, in an email or presentation program.
• Drag tools from the Tools tab in the Toolbox panel in the bottom, left side of the Workbench
into the canvas area of the Workflow Editor, or
• Use the Add Element dialog (figure 13.3). The following methods can be used to open this
dialog:
CHAPTER 13. WORKFLOWS 196
Click on the Add Element ( ) button at the bottom of the Workflow Editor.
Right-click on an empty area of the canvas and select the Add Element ( ) option.
Use the keyboard shortcut Shift + Alt + E.
Select one or more elements and click on OK. Multiple elements can be selected by keeping
the Ctrl key ( on Mac) depressed while selecting them.
• Use one of the relevant options offered when right-clicking on an input or output channel of
a workflow element, as shown in figure 13.4 and figure 13.5.
Figure 13.4: Connection options are shown in menus when you right click on an input or output
channel of a workflow element.
Figure 13.5: Right clicking on an output channel brings up a menu with relevant connection
options.
Once added, workflow elements can be moved around on the canvas using the 4 arrows icon
( ) that appears when hovering on an element.
Workflow elements can be removed by selecting them and pressed the delete key, or by right-
clicking on the element name and choosing Remove from the context specific menu, as shown
in figure 13.6.
Figure 13.6: Right clicking on an element name brings up a context specific menu that includes
options for renaming or removing elements.
• Click on an output channel and, keeping the mouse button depressed, drag the cursor to
the desired input channel. A green border around the input channel name indicates when
CHAPTER 13. WORKFLOWS 198
Figure 13.7: In this workflow, two elements are supplying data to the Reads input channel of the
Map Reads to Reference element, while data from the Reads Track output channel of Map Reads
to Reference is being used as input to two elements.
the connection has been made and the mouse button can be released. An arrow is drawn,
linking the channels (figure 13.8).
Figure 13.8: Connecting the "Reads Track" output channel from a Map Reads to Reference element
to the "Read Mapping or Reads" input channel of a Local Realignment element.
• Use the Connect <channel name> to... option in the right-click menu of an output or input
channel. Hover the cursor over this option to see a list of elements in the workflow with
compatible channels. Hovering the cursor over any of these items then shows the particular
channels that can be connected to (figure 13.9).
Information about what elements and channels are connected In a small workflow, it is easy
to see which elements are connected and how they are connected. In large workflows, the
following methods can be helpful:
• Mouse-over the connection line. A tooltip is revealed showing the elements and channels
that are connected (figure 13.10).
CHAPTER 13. WORKFLOWS 199
Figure 13.9: Right-clicking on an output channel displays a context specific menu, with options
supporting the connection of this channel to input channels of other workflow elements.
• Right-click on a connection line and choose the option Jump to Source to see the upstream
element or Jump to Destination to see the downstream element (figure 13.11).
Figure 13.10: Hover the mouse cursor over a connection to reveal a tooltip containing the names
of the elements and channels connected.
Figure 13.11: Right-click on a connection to reveal options to jump to the source element or the
destination element of that connection.
Removing connections
To remove a connection, right-click on the connection and select the Remove option (figure 13.11).
1. Improve the user experience by changing the order that launch wizard steps are pre-
sented.
By default, the order of the launch wizard steps reflects the order that Input elements were
added to the workflow when it was created.
To make changes to this order, right-click on an empty area of the canvas and choose Order
Workflow Inputs... from the menu that appears.
An Order Inputs dialog appears (figure 13.12). Select an input and move it up or down in
the list by clicking on the up arrow ( ( )) or down arrow ( ( )), respectively.
A number next to an Input element's name indicates its position in the order. These
numbers are updated when the ordering is updated.
These same numbers can be used in Output element naming patterns (see section 13.2.4).
If output names using such patterns have already been configured, they may need to be
updated.
2. Influence the content of outputs in cases where input processing order has an effect.
For example, the order of the sections in a report generated by Combine Reports reflects
the order that inputs to that tool are processed.
By default, the main inputs to a tool are processed in the order that the connections to that
input channel were added when the workflow was created.
To make changes to this order, right-click on the relevant input channel and choose the
option Order Inputs... from the menu that appears.
An Order Inputs dialog appears (figure 13.12). Select an input and move it up or down in
the list by clicking on the up arrow ( ( )) or down arrow ( ( )), respectively.
Numbers on the connection arrows are added or updated if any changes are made in this
Order Inputs... dialog.
Figure 13.12: The order of inputs is displayed and can be updated in an Order Inputs dialog.
• There must be at least one Input element connected to the main input channel of the
element where data starts its flow through the workflow. Where there are multiple
independent arms in the workflow, this requirement pertains to each of those arms.
• There must be at least one result saved from the end of each branch within a workflow. In
practice this means that at least one Output or Export element must be connected to each
terminal element with an output channel.
• All elements must have at least one connection to another element in the workflow.
Validation status is continuously monitored, with messages relating to this reported at the bottom
of the editor.
The validation status of a workflow will fall into one of three categories:
1. Valid and saved When a workflow is valid and has been saved, the message "Validation
successful" is displayed in green text at the bottom of the editor (figure 13.13).
Figure 13.13: The "Validation successful" message indicates that this workflow is valid and has
been saved.
2. Valid, with changes not yet saved When a workflow is valid but there are unsaved changes,
a single message is displayed at the bottom of the editor saying "The workflow must be
saved". The unsaved state is also represented by the asterisk in the name of the tab
(figure 13.14).
Valid workflows can be run before they are saved, allowing changes to be tested before
overwriting any previously saved version.
The Installation... button is enabled when a workflow is valid and has been saved. See
section or chapter 13.6.2 for information about workflow installation.
3. Invalid Each problem in a workflow is reported at the bottom of the editor (figure 13.15).
Clicking on a message about a specific element redirects the focus within the editor to that
element (figure 13.16).
CHAPTER 13. WORKFLOWS 202
Figure 13.14: This workflow has changes that have not yet been saved, as indicated by the
message at the bottom of the editor and the asterisk beside the workflow name in the tab at the
top.
Figure 13.15: Problems are reported at the bottom of the workflow editor.
CHAPTER 13. WORKFLOWS 203
Figure 13.16: Clicking on the error message about Filter against Known Variants at the bottom of
the editor moved the focus in the editor to that element.
CHAPTER 13. WORKFLOWS 204
Figure 13.17: All elements connected downstream of a selected element are highlighted after
selecting the Highlight Subsequent Path menu option.
• Manually: Select one or more workflow elements and then, with the left mouse button
depressed, drag these elements to where you want them to be on the canvas.
• Automatically: Right-click anywhere on the canvas and choose the option "Layout" (fig-
ure 13.18), or use the quick command Shift + Alt + L. The layout of all connected elements
in the workflow will be adjusted.
See also section 13.1.9 for information about the Auto Layout setting. When enabled that setting
causes the layout to be adjusted automatically every time an element is added and connected.
Figure 13.18: The alignment of workflow elements can be improved using the "Layout" function.
Figure 13.19: Use the Configuration Editor to edit configurable parameters for all the tools in a
given Workflow.
• View. Opens a dialog showing the snippet, which allows you to see the structure
If you right-click on the top-level folder you get the options shown in figure 13.24:
• Create new group. Creates a new folder under the selected folder.
• Remove group. Removes the selected group (not available for the top-level folder)
• Rename group. Renames the selected group (not available for the top-level folder)
In the Side Panel it is possible to drag and drop a snippet between groups to be able to rearrange
and order the snippets as desired. An exported snippet can either be installed by clicking on
the 'Install from file' button or by dragging and dropping the exported file directly into the folder
where it should be installed.
CHAPTER 13. WORKFLOWS 207
Figure 13.20: The selected elements are highlighted with a red box in this figure. Select "Install as
snippet".
Add a snippet to a workflow Snippets can be added to a workflow in two different ways; It
can either be added by dragging and dropping the snippet from the Side Panel into the workflow
editor, or it can be added by using the "Add element" option that is shown in figure 13.25.
CHAPTER 13. WORKFLOWS 208
Figure 13.21: In the "Create a new snippet" dialog you can name the snippet and select whether
or not you would like to include the configuration. In the right-hand side of the dialog you can see
the elements that are included in the snippet.
Figure 13.22: When a snippet is installed, it appears in the Side Panel under the "Snippets" tab.
CHAPTER 13. WORKFLOWS 209
Figure 13.24: Right-clicking on the snippet top-level folder makes it possible to manipulate the
groups.
Figure 13.25: Snippets can be added to a workflow in the workflow editor using the 'Add Element'
button found in the lower left corner.
CHAPTER 13. WORKFLOWS 210
Minimap A zoomed-out overview of the workflow. The darker grey box in the minimap highlights
the area of the workflow visible in the editor. Drag that box within the minimap to quickly
navigate to a specific area in the editor. The location of this dark grey box is updated when
you navigate to another area of the workflow.
CHAPTER 13. WORKFLOWS 211
Figure 13.27: Two elements with names including the term "venn" were found using the Find tool
in the side panel. Both are visible in this view, with the first element found highlighted.
Grid Customize the spacing, style and color of the symbols used in the grid on the canvas, or
choose not to display a grid. Workflow elements snap to the grid when they are added or
moved around.
View mode Settings under the View tab are particularly useful when working with large workflows,
as they can be used to remove aspects of the design that are not of immediate interest.
• Collapsed Enable this to hide the input and output channels of workflow elements
(figure 13.28).
CHAPTER 13. WORKFLOWS 212
Figure 13.28: The same workflow as above but with the "Collapse" option in the View mode settings
enabled.
• Highlight used elements. Enabling this option results in elements without at least
one input and one output connection to appear faded. Elements connected to those
missing connections are also faded (figure 13.29). (Shortcut: Alt + Shift + U)
CHAPTER 13. WORKFLOWS 213
Figure 13.29: A similar workflow to those above but with the "Highlight used elements" option in
the View mode settings enabled. The faded coloring makes it easy to spot that the workflow arm
starting with Differential Expression for RNA-Seq is not connected to the rest of the workflow.
• Rulers Adds rules along the left vertical and top horizontal edges of the canvas.
• Auto Layout Enable this option to adjust the layout automatically every time an element
is added and connected. Depending on the workflow design, using the "Layout" option
in the right-click menu over the canvas can be preferable (see section 13.1.6).
• Connections to background Enable this to put connection lines behind workflow
elements (figure 13.30).
See also the Design options, described below, where you can change the color and
design of connections.
CHAPTER 13. WORKFLOWS 214
Figure 13.30: A similar workflow to those above but with the "Connections to background" option
in the View mode settings enabled.
Design Options under the Design tab allow the shapes and colors of element and connections to
be defined. Of particular note is the ability to color elements with non-default configurations
differently to those with default settings.
Figure 13.31: A similar workflow to those above, but where standard elements with non-default
configuration have been assigned the color pink and control flow elements with non-default
configurations have been assigned a pale green color, making them easy to spot.
Snippets Snippets are sections of workflows, which have been saved and can be easily added
to a new workflow. These are described in section 13.1.8.
Light green Input elements. Elements for taking in the data to be analyzed.
Dark blue Output and Export elements. Elements that indicate data should be saved to disk,
either in a CLC location (Output elements) or any other accessible file location (Export
elements).
Light grey An analysis element where the default values are used for all settings.
Purple A configured analysis element, i.e. one or more values in that element have been changed
from the defaults.
Forest green Configured control flow elements. i.e. one or more values in that element have
been changed from the defaults.
Background colors can be changed under the Design tab in the side panel settings of the
Workflow editor.
The name of a new element added to a workflow is shown in red text until it is properly connected
to other elements.
Configuring Input and Output elements is described insection 13.2.3 and section 13.2.4.
Control flow elements, used to fine tune control of the execution of whole workflows or sections
of workflows, are described in section 13.2.5.
CHAPTER 13. WORKFLOWS 217
Figure 13.33: An element's color indicates the role it plays and its state. Here, Trim Reads means
is using only default parameter values, whereas the purple background for Map Reads to Reference
indicates that one or more of its parameter values have been changed. The green elements are
Input elements. The blue color of the InDels element and parts of the Export PDF element indicate
that data sent to these elements will be saved to disk.
Figure 13.34: A workflow before (left) and after (right) the Map Reads to Reference element was
renamed. In the linked Workflow Configuration view at the bottom right, both the original and
updated element names are listed.
• Right-clicking on an element name and choosing the Configure... option from the menu that
appears.
Options can also be edited in the Workflow Configuration view (figure 13.34).
Workflow element customization can include:
Figure 13.35: The Workflow view (top) and Workflow Configuration view (bottom) have been opened
as linked views. The Map Reads to Reference element has been opened for configuration in
the Workflow view and the Masking mode and Masking track options have been unlocked. They
will correspondingly appear unlocked in the Workflow Configuration view after the Finish button is
clicked.
CHAPTER 13. WORKFLOWS 220
Figure 13.36: A workflow launch wizard step showing the configurable (unlocked) options at the
top, with a heading for the locked settings (top). Clicking on the Locked Settings heading reveals a
list of the locked options and their values (bottom)
CHAPTER 13. WORKFLOWS 221
Figure 13.37: An option originally called "Match score" has been renamed "Score for each match
position" in the element configuration dialog. It has also been unlocked so the value for this option
will be configurable when launching the workflow.
Note: Clicking on the Reset button in a workflow element configuration dialog will reset all
changes in that particular configuration step to the defaults, including any updated option
names.
import. The workflow author can limit these options (figure 13.38), as well as configure import
options with non-default values. Settings can be locked if they should not be configurable when
launching the workflow (see Basic configuration of workflow elements (section 13.2.2)).
On-the-fly import options that can be configured in Input elements are:
• Allow any compatible importer All compatible importers will be available when launching
the workflow and all the options for each importer will be configurable.
• Allow selected importers Specify particular importers to be available when launching the
workflow. Options for each selected importer can be configured by clicking on the Configure
Parameters button.
Note: To specify CLC data not stored in a CLC location as input to a workflow, on-the-fly import
must be allowed, and CLC Format must be one of the allowed importers. See also Launching
workflows individually and in batches (section 13.3).
Figure 13.38: The allowed data sources are configured in Input elements. By default, both
checkboxes in the Advanced section are enabled, allowing data to be selected from a CLC location
(input from the Navigation Area), or from another location (on-the-fly import). For on-the-fly import,
any available importer can be used by default. When only specific importers are allowed, those
importers can be configured by selecting each in turn and clicking on the "Configure Parameters"
button. In this case, only the data types supported by the specified importers, can be selected as
input for on-the-fly import when launching the workflow.
The "Workflow role" field visible when configuring Workflow Input elements connected to pa-
rameter input channels is relevant when working with the CLC Genomics Workbench. Fur-
ther information about this can be found at https://resources.qiagenbioinformatics.com/manuals/
clcgenomicsworkbench/current/index.php?manual=QIAGEN_Sets.html and https://resources.qiagenbioinformatics.
com/manuals/clcgenomicsworkbench/current/index.php?manual=Custom_Sets.html
CHAPTER 13. WORKFLOWS 223
Figure 13.39: Using this workflow, data imported on-the-fly would be saved as an output from the
Save On-the-Fly Imports element.
A Save On-the-Fly Imports element is not needed when an Iterate element is connected to the
Input element being used for on-the-fly import. In this situation, an Output element can be
connected directly to the Iterate element (figure 13.40).
CHAPTER 13. WORKFLOWS 224
Figure 13.40: When an Iterate element is connected to an Input element, data imported on-the-fly
can be saved by connecting an Output element to the Iterate element.
Results generated by a workflow are only saved if the relevant output channel of a
workflow element is connected to a Workflow Output element or an Export element. Data
sent to output channels without an Output or Export element attached are not saved.
Terminal workflow elements with output channels must have at least one Workflow Output
element or Export element connected.
tion 8.1.3. Other settings relating to export, relevant both for exports run directly or in a workflow
context, are described in section 8.1.2.
Figure 13.41: Defining the name to assign to an output from a workflow. The default naming
pattern for Output elements uses the placeholder {1}, which is a synonym for the placeholder
{name}.
Figure 13.42: Hover the mouse cursor over the field where a custom name can be configured to
reveal a tooltip with a list of available placeholders.
• {name} or {1} The default name for that output from that tool, i.e. the name that would be
used if the tool was run outside a workflow context.
• {input} or {2} The name of the primary workflow input(s) for the path of the workflow being
traversed.
"Primary workflow input" generally refers to the data being analyzed, i.e. inputs expecting
sample data, as opposed to inputs expecting reference data.
CHAPTER 13. WORKFLOWS 226
For a workflow with multiple primary inputs to an arm of the workflow, {input}, or its
equivalent {2}, would result in the name of each of these primary inputs being included in
the names of the outputs from that workflow arm (figure 13.43).
Figure 13.43: Top: A contrived workflow with two primary inputs (green boxes). The QC for
Sequencing Reads step receives data only from the first input, "Reads to Quality Trim". The Map
Reads to Reference step receives data originating from both primary inputs. Bottom: The effect
of different naming patterns on result names when a sequence list called "sample1" was supplied
for the first input and a sequence list called "sample2" was supplied for second input. The first
row shows the Output elements and results using the default naming pattern, {1}. The middle row
shows the Output elements and results when the naming pattern included the placeholder {2}, and
the last row shows them when the naming pattern included the placeholder {2:1}.
• {input:N} or {2:N} The name of the Nth input to the workflow. E.g. {2:1} specifies the first
input to the workflow, while {2:2} specifies the second input (figure 13.43).
Unlike the general form described above, i.e.{input} or {2}, reference data inputs can be
included in names using this placeholder form (figure 13.44).
For a workflow with only one primary input, {input} or {2} is equivalent to the more specific
form {input:1} or {2:1}.
For workflows containing control flow elements, the specific placeholder form, {2:N}, is
recommended.
See section 13.1.3 for information about workflow input ordering, and section 13.2.5 for
information about control flow elements.
• {metadata} or {3} The batch unit identifier for workflows executed in batch mode. Depending
on how the workflow was configured at launch, this value may be obtained from metadata.
CHAPTER 13. WORKFLOWS 227
Figure 13.44: Top: A contrived workflow with two primary inputs and a reference data input (green
boxes). Bottom: The names of the results generated in a given workflow run. The naming pattern
for the Reads Track output includes {2}, which adds the names of all primary inputs to that analysis
step, (sample1, sample2), and {2:3}, which adds the name of the third input, whatever the role
that input has. In this case, it is a reference data input and an element called Escherichia coli
(ASM584v2) was supplied .
For workflows not executed in batch mode or without Iterate elements, the value will be
identical to that substituted using {input} or {2}.
Note: For workflows containing control flow elements, the more specific form of place-
holder, i.e. the metadata:columnname or {3:columnname} form, described below, is
recommended.
• {metadata:columnname} or {3:columnname} The value for the batch unit in the column
named "columnname" of the metadata selected when launching the workflow. Pertinent
for workflows executed in batch mode or workflows that contain Iterate elements. If a
column of this name is not found, or a metadata table was not provided when launching
the workflow, then the value will be identical to that substituted using {input} or {2}.
• {year}, {month}, {day}, {hour}, {minute}, and {second} Timestamp information based on
the time an output is created. Using these placeholders, items generated by a workflow at
different times can have different file names.
In addition to the placeholders above, the placeholder {extension} is available for exported file
names. This is replaced by the default file extension for the exported file's format, e.g. .pdf, .txt.
For example, with an Output element configured with /variants/{name}, the resulting output
would be saved to a subfolder called variants, placed within the folder selected for outputs
when the workflow is launched. If a specified subfolder does not already exist, it is created when
the outputs are saved.
When defining subfolders for outputs or exported files, terms between all forward slash characters
are interpreted as subfolders. For example, a name like /variants/level2/level3/myoutput
would put the data item called myoutput into a folder called level3 within a folder called
level2, which itself is inside a folder called variants. The variants folder would be placed
under the location selected for storing the workflow outputs.
1. Controlling how data is grouped for analysis. These include Iterate and Collect and
Distribute, described in section 13.2.5.
2. Controlling the flow of the workflow based on its configuration when launched. These
include Fork, described in section 13.2.5, and Save On-the-Fly Imports, described in sec-
tion 13.2.3.
3. Controlling the flow through the workflow based on aspects of the data. There are several
such branching elements, described in section 13.2.5.
• Iterate elements are placed at the top of a branch of a workflow that should be run multiple
times, using different inputs in each run. The sets of data to use in each run are referred
to as "batch units" or, sometimes, "iteration units".
• Collect and Distribute elements are, optionally, placed downstream of an Iterate element,
where they collect outputs from the upstream iteration block (see below) and distribute
them as inputs to downstream analyses.
CHAPTER 13. WORKFLOWS 229
Figure 13.45: Control flow elements are found under the Control Flow folder in the Add Elements
wizard.
The RNA-Seq and Differential Gene Expression Analysis template workflow, distributed with the
CLC Genomics Workbench (https://resources.qiagenbioinformatics.com/manuals/clcgenomicsworkbench/
current/index.php?manual=RNA_Seq_Differential_Gene_Expression_Analysis_workflow.html) is an ex-
ample of a workflow that includes each of these control flow elements.
The steps between an Iterate element and a Collect and Distribute element are referred to as
an "iteration block". The workflow in figure 13.46 contains a single iteration block (shaded in
turquoise), where steps within that block are run once per batch unit. The Collect and Distribute
element (renamed to Collect Expressions) collects all the results from the iteration block and
sends it as input to the next stage of the analysis (shaded in purple).
CHAPTER 13. WORKFLOWS 230
Figure 13.46: The roles of the Iterate and Collect and Distribute control flow elements are
highlighted in the context of RNA-Seq and differential expression analyses. RNA-Seq Analysis lies
downstream of an Iterate element, within an iteration block (shaded in turquoise). It will thus be run
once per batch unit. Differential Expression for RNA-Seq lies immediately downstream of a Collect
and Distribute element (renamed to Collect Expressions), and is sent all the expression results from
the iteration block as input for a single analysis.
Iterate element names are included in the workflow launch wizard in the following steps:
• Configure batching: The name of Iterate elements are provided in association with the
drop-down list of column names in the metadata provided. A meaningful Iterate element
name can thus help guide the choice of relevant metadata to group the inputs into batch
units (figure 13.47).
• Batch overview: There is a column for each Iterate element (figure 13.48). Meaningful
names can thus make it easier to review batch unit organization critically when launching
the workflow.
Figure 13.47: The two Iterate elements in this workflow (right) have been renamed. Their names
are included in the "Configure batching" wizard step in the launch wizard (left).
Figure 13.48: The batch overview for a workflow with two Iterate elements. The names assigned to
the two columns containing the batch unit organization are the names of the corresponding Iterate
elements.
1. Number of coupled inputs The number of separate inputs for each given iteration. These
inputs are "coupled" in the sense that, for a given iteration, particular inputs are used
together. For example, when sets of sample reads should be mapped in the same way, but
each set should be mapped to a particular reference (figure 13.50).
2. Error handling Specify what should happen if an error is encountered. The default is that
the workflow should stop on any error. The alternative is to continue running the workflow
if possible, potentially allowing later batches to be analyzed even if an earlier one fails.
CHAPTER 13. WORKFLOWS 232
3. Metadata table columns If the workflow is always run with metadata tables that have the
same column structure, then it can be useful to set the value of the column titles here, so
the workflow wizard will preselect them. The column titles must be specified in the same
order as shown in the workflow wizard when running the workflow. Locking this parameter
to a fixed value (i.e. not blank) will require the definition of batch units to be based on
metadata. Locking this parameter to a blank value requires the definition of batch units to
be based on the organization of input data (and not metadata).
4. Primary input If the number of coupled inputs is two or more, then the primary input (used
to define the batch units) can be configured using this parameter.
Figure 13.49: The number of coupled inputs in this simple example is 2, allowing each set of
sample reads to be mapped to a paticular reference, rather than using the same reference for all
iterations.
Figure 13.50: Reads can be mapped to specified contigs due to the 2 input channels of the Iterate
element. Using this design, a single sequence list containing all the unmapped reads from all the
initial inputs is generated. That would not be possible without the inclusion of the Iterate and
Collect and Distribute elements.
Figure 13.51: A comma separated list of terms in the Outputs field of the Collect and Distribute
element defines the number of output channels and their names.
CHAPTER 13. WORKFLOWS 234
Figure 13.52: In this workflow, each case sample is analyzed against all of the control samples.
CHAPTER 13. WORKFLOWS 235
Figure 13.53: Contents of the metadata column "Type" define which samples are cases and which
are controls. Iteration units are defined by the contents of the "ID" column.
CHAPTER 13. WORKFLOWS 236
Fork
A choice between running particular parts of an analysis can be offered by including one or more
Fork ( ) elements in a workflow.
Including Fork elements in workflows can also help decrease the number of workflows that need
to be maintained, as multiple analysis paths can be included in a single workflow, with only one
or some of those paths being taken when the workflow is run.
For example, when the workflow shown in figure 13.54 is launched, a choice between "Quality"
and "Quality and Vector" is offered in the launch wizard. Choosing "Quality" means the data will
flow down the path containing the "Trim on Quality" element, while choosing "Quality and Vector"
means the data will flow down the path containing the "Trim Vector and on Quality" element.
Figure 13.54: A simple workflow with a Fork element. When launched, a choice is offered in the
launch wizard for which path the the analysis should follow.
• The name of the Fork element The element name is used in the launch wizard to describe
the choice being made.
• The configurable fields in the Fork element The configurable fields are used to define the
available options and, if desired, specify a default:
CHAPTER 13. WORKFLOWS 237
Path names The list of choices of downstream paths from this Fork element, entered
as a comma delimited, text list. These are presented in a drop-down list in the launch
wizard. An output channel is added to the Fork element for each path name.
Selected path One of the configured path names, which then will be used as the
default. If this field is left blank, the first of the paths in the "Path names" list will be
pre-selected in the wizard when launching the workflow from a CLC Workbench.
Figure 13.55: A Fork element renamed as "Trim for" is open for configuration. The wizard step
seen when launching a workflow with this Fork element is shown in figure 13.56.
Figure 13.56: This workflow has a single Fork element, renamed as "Trim for". The path names,
"Quality" and "Quality and Vector", (see figure13.55) are listed in the "Specify workflow path" launch
wizard step (top right)
• Providing the choice of following one of several possible analysis paths (figure 13.56).
• Providing the choice of following one of several possible analysis paths or following multiple
analysis paths (figure 13.57).
• Providing the choice of whether or not to run a particular part of the analysis (figure 13.58).
When a workflow contains multiple Fork elements, all the corresponding choices are presented
in a single "Specify workflow paths" launch wizard step (figure 13.58).
CHAPTER 13. WORKFLOWS 238
Figure 13.57: When launching this workflow, the choice can be made to trim sequences based on
quality, trim for vector sequence, or trim for both.
Figure 13.58: There are two Fork elements in this workflow, and thus two choices will need to be
made when launching it. Both are presented in the "Specify workflow path" launch wizard step. The
Yes/No choice for the "Generate sequence statistics" option determines whether or not the Create
Sequence Statistics analysis will be run.
Branching elements
Branching elements control the path that data takes through a workflow.
The sequence list provided as input will flow through the Pass or the Fail output channel depending
on whether the number of sequences meets the condition specified in the branching element
(figure 13.59).
Figure 13.59: If the sequence list provided as input meets the condition specified in a Branch on
Sequence Count element, it will flow through the Pass output channel and be used in the Assemble
Sequences step. Otherwise, it will flow through the Fail output channel, where here, it would not be
processed further.
In the Branch on Sequence Count configuration dialog (figure 13.60), the configuration options
are:
• Comparison The operator to use for the comparison: >=, = or <=, offered in a drop-down
list.
• Double click on the workflow name in the Workflows tab in the Toolbox panel, which is in
the bottom, left side of the Workbench.
• Select the workflow from the Workflows menu at the top of the Workbench.
Workflow inputs
Data to be used in a workflow analysis can be selected from the Navigation Area or can be
imported on-the-fly from files stored elsewhere. The specific options available depend on how the
workflow was configured by the author. Using on-the-fly import, the first action taken when the
workflow is run is to import the specified data.
When Select files for on-the-fly import is selected, the format of the data files must be specified
using the drop-down menu beside this option. If configuration options are available for the
selected importer, they will be shown in the lower part of the dialog.
If remote locations are available, such as CLC Server import/export directories, or AWS S3
buckets, a Location drop-down menu will be visible above the file selection area, either in the
launch wizard (figure 13.61) or in the Select files dialog that opens when a Browse button is
clicked on.
Note:
• To use CLC data stored in an AWS S3 bucket in a workflow analysis, you must choose the
option Select files for on-the-fly import and choose the format CLC Format.
• If you select data from an AWS S3 bucket for an analysis that will be run on your CLC
Workbench or CLC Server, the data will be downloaded from AWS before the analysis
begins. For large datasets, this may take some time. Downloading from AWS S3 to a local
file system may incur charges from AWS. See AWS S3 pricing.
Figure 13.61: Input data is specified when launching a workflow. CLC data can be selected from
the Navigation Area. Data stored elsewhere can be selected after choosing the option "Select files
for on-the-fly import" and specifying the format of that data.
For information about configuring workflow Input elements when creating or editing a workflow,
see section 13.2.3.
Workflow outputs
Output and Export elements in workflows specify the analysis results to be saved. In addition to
analysis results, a Workflow Result Metadata table can be output. This contains a record of the
workflow outputs, as described in section 13.3.1.
CHAPTER 13. WORKFLOWS 242
The history of data elements generated as workflow outputs contains the name and version the
workflow that created it. When an installed workflow was used, the workflow build id is also
included (see section 2.5).
Figure 13.62: The final step when launching a workflow includes an option to create a workflow
result metdata table.
See section 12.3.1 for information on finding and working with data associated with metadata
rows.
CHAPTER 13. WORKFLOWS 243
Figure 13.63: The Workflow Result Metadata table, top left, was generated from a run of the
workflow on the right. Here, 4 RNA-Seq Anaylysis runs occurred within the iteration loop (between
the Iterate and the Collect and Distribute elements). Those results were then supplied to Differential
Expression in Two Groups, which was run once. There are thus 5 rows in the Workflow Metadata
Result table. The RNA-Seq Analysis results each have a batch identifier, while the statistical
comparison output does not.
• The Batch checkbox at the bottom of input steps in the launch wizard has been checked,
and/or
• The workflow contains one or more Iterate control flow elements. Steps downstream of
Iterate elements and upstream of Collect and Distribute elements, if present, are run
2
There is one exception to this. Where batch units have been defined by the organization of the input data and
the outputs are to be saved in the same folders as the inputs, one workflow result metadata table is generated per
analysis.
CHAPTER 13. WORKFLOWS 244
A batch unit consists of the data that should be analyzed together. The grouping of data into
batch units is defined after the inputs for analysis have been selected.
• Where there is more than one level of batch units. This could be:
A workflow with more than one Input element, where the inputs to both of these
should be grouped into batch units. An example of such a workflow is described
in section 13.4.
A workflow containing more than one Iterate element.
A workflow containing containing an Iterate element that will be run in Batch mode. An
example of this is described in the "RNA-Seq and differential gene expression analysis"
tutorial, available from https://resources.qiagenbioinformatics.com/tutorials/RNASeq-
DGE-analysis.pdf.
• Where Iterate or Collect and Distribute elements in the workflow have been configured to
require metadata.
Note: When launching a workflow containing analysis steps that require metadata, the metadata
provided to define batch units is also used for those analysis steps. For example, in the RNA-Seq
and Differential Gene Expression Analysis template workflow, metadata provided to define batch
units is also used for the Differential Expression for RNA-Seq step.
There are two ways metadata defining batch units can be provided:
1. Using a CLC Metadata Table In this case, the data elements selected as inputs must
already have associations to this CLC Metadata Table.
If a CLC Metadata Table with data associated to it has been selected in the "Select Workflow
Input" step of a workflow, that table will be pre-selected in the "Configure batching" step
of the launch wizard. You can specify the column that batch units will be based on. Data
associated with the table rows for each unique value in that column make up the contents
CHAPTER 13. WORKFLOWS 245
of the batch units. The contents can be refined using the fields below the preview pane
(figure 13.64).
Outputs from the workflow that can be unambiguously identified with a single row of the
CLC Metadata Table will have an association to that row added. Outputs derived from two
or more inputs with different metadata associations will not have associations to this CLC
Metadata Table.
Figure 13.64: A CLC Metadata Table with data associated to it was selected as input to a workflow
being launched in Batch mode. In the Configure batching wizard step, the metadata source is
pre-configured. The column to base batch units on can be selected (top). The Batch overview
step shows the data elements in each batch unit. Here "trim" has been entered in the "Only use
elements containing" field, so only elements containing the term "trim" in their names are included
in the batch units (bottom).
2. Using an Excel, CSV or TSV format file. The metadata in the file is imported into the CLC
software at the start of the workflow run. Requirements for this file are:
Figure 13.65: Paired fastq files from two samples were selected for import (top). The Excel file with
information about this data set contains a header row and 4 rows of information, one row per input
file. The contents of the first column contain enough of each file name to uniquely identify each
input file. The second column contains sample IDs.
If there is a tool in the workflow that requires descriptive information, for example, factors
for statistical testing in Differential Expression for RNA-Seq, then the file should also
contain columns with this information.
For example, if a data element selected in the Navigation Area has the name
atp8a_1_sample1_day3, then the first column could contain that name in full, or just
enough of the first part of the name to uniquely identify it. This could be, for example,
atp8a_1_sample1. Similarly, if a data file selected for on-the-fly import is at:
C:\Users\username\My Data\atp8a_1 sample1_day3.clc, the first column of the Ex-
cel spreadsheet could contain atp8a_1_sample1_day3.clc, or a prefix long enough to
uniquely identify the file, e.g. atp8a_1_sample1.
sequence file in each batch unit. If a column containing fewer unique values was selected, one
or more batch units would consist of several files. This is illustrated in figure 13.66.
Figure 13.66: Batch units are defined according to the values in the SRR_ID column of the Excel
file that was selected.
In the next step, a preview of the batch units is shown. The workflow will be run once for each
entry in the left hand column, with the input data grouped as shown in the right hand column
(figure 13.67).
Figure 13.67: The Batch overview step allows you to review the batch units. In the top image, a
column called SRR_ID had been selected, resulting in 8 batch units, so 8 workflow runs, with the
data from one input file to be used in each batch. In the lower image, a different column was
selected to define the batch units. There, the workflow would be run 3 times with the input data
grouped as shown.
directories per batch unit (figure 13.69). When that option is checked, files exported are placed
into separate subfolders under the the export folder selected for each export step.
Figure 13.68: An Excel file at the top describes 4 Sanger files that make up two pairs of reads.
The "Sample Name" column was identified as the one indicating the group the file belongs to.
Information about the relevant sample appears in each row. At the Batch overview step, shown at
the bottom, you can check the batch units are as intended.
(figure 13.72).
CHAPTER 13. WORKFLOWS 250
Figure 13.69: Options are presented in the final wizard step for configuring where outputs and
exported files from each batch run should be saved.
Figure 13.70: The RNA-Seq analysis tool is run once per sample and a single combined report is
then generated for the full set of samples.
Figure 13.71: With the current selection in the wizard, the RNA-Seq Analysis tool will run 8 times,
once for each sample. The Combine Reports tool will run once.
CHAPTER 13. WORKFLOWS 251
Figure 13.72: The Iterate element can be renamed to change the text that is displayed in the
wizard when running the workflow.
CHAPTER 13. WORKFLOWS 252
• Grouping the data into different subsets to be analyzed together in particular sections of
a workflow. Groupings of data can be used in the following ways:
Different groupings of data are used as inputs to different sections of the same
workflow. For details, see section 13.3.3 and section 13.4.2.
Different workflow inputs follow different paths through parts of a workflow. Based
on metadata, samples can be distributed into groups to follow different analysis paths
in some workflow sections, at the same time as processing them individually and
identically through other sections of the same workflow.
Configuring Collect and Distribute elements is central to the design of this work-
flow. This is described in section 13.2.5. Running such workflows is described in
section 13.3.3.
• Matching particular workflow inputs for each workflow run. Where more than one input
to a workflow changes per run, the particular input data to use for each run can be defined
using metadata. The simplest case is as described in section 13.4.1. However, more
complex scenarios, such as when intermediate results should be merged or parts of the
workflow should be run multiple times, can also be catered for using control flow elements
(see section 13.2.5).
Examples in this section make reference to CLC Genomics Workbench tools and data types
commonly analyzed using that software. However, the principles apply equally to workflows
created in the CLC Main Workbench.
13.4.1 Batching workflows with more than one input changing per run
When a workflow contains multiple Input elements (multiple light green boxes),
A Batch checkbox is available in the launch wizard for each Input element attached to a main
input channel.
Checking that box indicates that the data supplied for that input should change in each batch
run.
By contrast, if multiple elements are selected, and the Batch option is not checked, all elements
will be treated a single set, to be used in a single analysis run.
Most commonly, one input is changed per run. For example, in a batch analysis involving read
mappings, usually each batch unit would include a different set of reads, but the same reference
sequence.
However, it is possible to have two or more inputs that are different in each batch unit. For
example, an analysis involving a read mapping, where each set of reads should be mapped to a
different reference sequence. In cases like this, batch units must be defined using metadata.
CHAPTER 13. WORKFLOWS 253
Figure 13.73 shown an example where the aim is to do just this. The workflow contains a
Map Reads to Contigs element and two workflow input elements, Sample Reads and Reference
Sequences. The information to define the batch units is provided by two Excel files, one
containing information about the Sample Reads input and the other with information about the
Reference Sequences input. The contents of files that would work for this example are shown in
figure 13.74.
Of particular note are:
• The first column of file contains the exact file names for all data for that input, across all
of the batch runs.
• At least one column in each file has the same name as a column in the other file. That
column should contain the information needed to match the input data, in this case, the
Sample Reads input data with the relevant Reference Sequences input data for each batch
unit.
Figure 13.73: A workflow with 2 inputs, where the Batch checkbox had been checked for both in
the relevant launch steps. Metadata is used to define the batch units since the correct inputs must
be matched together for each run.
In the Workflow-level batching section of the launch wizard, the following are specified:
• The primary input. The input that determines the number of times the workflow should be
run.
• The column in the metadata for the primary input that specifies the group the data belongs
to. Each group makes up a single batch unit.
• The column in both metadata files that together will be used to ensure that the correct data
from each workflow input are included together in a given batch run. For example, a given
set of sample reads will be mapped to the correct reference sequence. A column with this
name must be present in each metadata file or table.
CHAPTER 13. WORKFLOWS 254
Figure 13.74: Two Excel files containing information about the data for each batch unit for the
workflow shown in figure 13.73. With the settings selected there, the number of batch runs will
be based on the Sample Reads input, and will equal the number of unique SRR_ID entries in the
DrosophilaMultiReference.xlsx file. The correct reference sequence to map to is determined by
matching information in the Reference column of each Excel file.
In figure 13.73, Sample Reads is the primary input: We wish to run the workflow once for
each sample, which here, is once for each SRR_ID entry. The Reference sequence to use for
each of these batch units is defined in a column called Reference, which is present in both
the file containing information about the samples and the file containing information about the
references.
Figure 13.75: The top-level Iterate element results in a subdivision (grouping) of the data, and the
innermost Iterate results in a further subdivision (grouping) of each of those groups.
When running the workflow, only metadata can be used to define the groups, because the
workflow contains multiple levels of iterations (figure 13.76).
CHAPTER 13. WORKFLOWS 255
Figure 13.76: When the workflow contains multiple levels of iterations, only metadata can be used
to define the groups.
It is always possible to execute a third level of batching by selecting the Batch checkbox when
launching the workflow: this will run the whole workflow, including the inner batching processes,
several times with different sets of data.
Control flow elements are described in more detail in section 13.2.5.
Figure 13.77: The Template Workflows folder in the Workflows tab of the Toolbox
• From under the Workflows tab in the Toolbox in the lower, left side of the Workbench:
Right-click on the workflow name and select the option Open Copy of Workflow from the
menu that appears.
or
CHAPTER 13. WORKFLOWS 256
• From the Workflow Manager: Open the Workflow Manager by clicking on the Manage
Workflows button ( ) in the toolbar, and choose the option Manage Workflows.
Click on the Template Workflows tab and then select the workflow of interest. Then click
on the Open Copy of Workflow button.
You can specify which settings can be adjusted when launching a workflow, and which cannot,
by unlocking or locking parameters in workflow elements. Unlocked parameters can be adjusted
when launching the workflow. For locked parameters, the value specified in the design is always
used when the workflow is run.
Installed workflows cannot be edited directly, so by locking settings, and installing the workflow,
you create a customized version of a template workflow, validated for your purposes, where you
know exactly the settings that will be used for each workflow run.
Related documentation
The following manual pages contain information relevant to working with copies of template
workflows:
• Configuring workflow elements, including locking and unlocking parameters: section 13.2.2
• Tips for configuring the view of workflows when editing them: section 13.1.9
The template workflows distributed with the CLC Main Workbench are described after this section.
Template workflows distributed with plugins are described in the relevant plugin manual.
• Trim Sequences. Adds Trim annotations to sequences. Trimming options can be configured
when launching the workflow. A trimming report is generated. See section 21.2 for more
information.
Figure 13.78: The Trim and Map Sanger Sequences template workflow
Note: Copies of installed and template workflows can also be opened from under the Workflows
tab in the Toolbox at the bottom left side of the Workbench. Right-click on the workflow name
and choose "Open Copy of Workflow" from the menu that appears.
Figure 13.79: An installed workflow has been selected in the Workflow Manager. Some actions
can be carried out on this workflow, and a preview pane showing the workflow design is open on
the right hand side.
Configure
Clicking on the Configure button for an installed workflow will open a dialog where configurable
steps in the workflow are shown (figure 13.80). Settings can be configured, and unlocked settings
can be locked if desired.
Note: Parameters locked in the original workflow design cannot be unlocked. Those locked using
the Configure functionality of the Workbench Manager can be unlocked again later in the same
way, if desired.
Parameter locking is described further in section 13.2.2.
Note that parameters requiring the selection of data should only be locked if the workflow will
only be installed in a setting where there is access to the same data, in the same location, as
the system where the workflow was created, or if the parameter is optional and no data should
be selected. If the workflow is intended to be executed on a CLC Server, it is important to select
data that is located on the CLC Server.
Rename
Clicking on the Rename button for an installed workflow allows you to change the name. The
workflow will then be listed with that new name in the "Installed Workflows" folder in the
Workflows menu.
CHAPTER 13. WORKFLOWS 259
Description, Preview and Information In the right hand pane of the Workflow Manager, are
three tabs.
• Description Contains the description that was entered when creating the workflow installer
(figure 13.81). See section 13.6.2.
• Information Contains general information about that workflow, such as the name, id,
author, etc. (figure 13.82, and described in detail below).
• Workflow build id The date (day month year) followed by the time (based on a 24 hour
time) when the workflow installer was created. If the workflow was installed locally without
an installation file being explicitly created, the build ID will reflect the time of installation.
• Referenced data If reference data was referred to by the workflow and the option Bundled
or Reference was selected when the installer was made, the reference data referred to is
listed in this field. See section 13.6.2 for further details about these options.
• Author email The email address the workflow author entered when creating the workflow
installer.
CHAPTER 13. WORKFLOWS 260
Figure 13.81: The description provided when creating the workflow installer is available in the
Description tab in the Workflow Manager.
Figure 13.82: The Information tab contains the information provided when the workflow installer
was created as well as the workflow build-id.
• Author homepage The homepage the workflow author entered when creating the workflow
installer.
• Author organization The organization the workflow author entered when creating the
workflow installer.
• Author name The workflow author's name.
• Workflow version The version that the author assigned to the workflow when creating the
installer.
• Created using Workbench version The version of the CLC Workbench used when the
workflow installer was created.
• Updating installed workflows when using software in a higher major version line
"Major version line" refers to the first digit in the version number. For example, versions
23.0.1 and 23.0.5 are part of the same major release line (23). Version 22.0 is part of a
different major version line (22).
Figure 13.83: The workflow update editor lists tools and parameters that will be updated.
To update the workflow, click on the OK button at the bottom of the editor.
The updated workflow can be saved under a new name, leaving the original workflow unaffected.
Updating installed and template workflows when using an upgraded Workbench in the same
major version line
When working on an upgraded CLC Workbench in the same major version line, installed and
template workflows are updated using the Workflow Manager.
To start the Workflow Manager, go to:
Utilities | Manage Workflows ( )
or click on the "Workflows" button ( ) in the toolbar, and select "Manage Workflow..." ( )
from the menu that appears.
CHAPTER 13. WORKFLOWS 262
A red message is displayed for each workflow that needs to be updated. An individual workflow
can be updated by selecting it and then clicking on the Update... button. Alternatively, click on
the Update All Workflows button to carry out all updates in a single action (figure 13.84).
Figure 13.84: A message in red text indicates a workflow needs to be updated. The Update
button can be used to update an individual workflow. Alternatively, update all workflows that need
updating by clicking on the Update All Workflows button.
When you update a workflow through the Workflow Manager, the old version is overwritten.
To update a workflow you must have permission to write to the area the workflow is stored in.
Usually, you will not need special permissions to do this for workflows you installed. However,
to update template workflows, distributed via plugins, the CLC Workbench will usually need to be
run as an administrative user.
When one or more installed workflows or template workflows needs to be updated, you are
informed when you start up the CLC Workbench. A dialog listing these workflows is presented,
prompting you to open the Workflow Manager (figure 13.85).
Updating installed workflows when using software in a higher major version line
To update an installed workflow after upgrading to software in a higher major version line, you
need a copy of the older Workbench version, which the installed workflow can be run on, as well
as the latest version of the Workbench.
To start, open a copy of the installed workflow in a version of the Workbench it can be run
on. To do this, right-click on the workflow's name in the Installed Workflows, folder under the
Workflows tab in the Toolbox panel in the bottom left side of the Workbench, and choose the
option "Open Copy of Workflow" from the menu that appears (figure 13.86).
Save the copy of the workflow. One way to do this is to drag and drop the tab to the location of
your choice in the Navigation Area.
Close the older Workbench and open the new Workbench version. In the new version, open the
workflow you just saved. Click on the OK button if you are prompted to update the workflow.
After checking that the workflow has been updated correctly, including that any reference data is
configured as expected, save the updated workflow. Finally, click the Installation button to install
the workflow, if desired.
If the above process does not work when upgrading directly from a much older Workbench version,
CHAPTER 13. WORKFLOWS 263
Figure 13.85: A dialog reporting that an installed workflow needs to be updated to be used on this
version of the Workbench.
Figure 13.86: Open a copy of an installed workflow by right-clicking on its name in the Workflows
tab in the Toolbox and choosing the "Open Copy of Workflow" option from the menu.
it may be necessary to upgrade step-wise by upgrading the workflow in sequentially higher major
versions of the Workbench.
compatible CLC Workbench or CLC Server. If you are logged into a CLC Server as a user with
appropriate permissions, you will also have the option to install the workflow directly on the CLC
Server.
Organization (Required) The organization name. This is used as part of the workflow id
(section 13.6).
Workflow name (Required) The name of the workflow, as it should appear under the Workflows
menu after installation. Changing this does not affect the name of the original workflow
(as appears in your Navigation Area). This name is also used as part of the workflow id
(section 13.6).
ID The workflow id. This is created using information provided in other fields. It cannot be directly
edited.
Workflow icon An icon to use for this workflow in the Workflows menu once the workflow is
installed. Icons use a 16 x 16 pixel gif or png file. If your icon file is larger, it will
automatically be resized to fit 16 x 16 pixels.
Workflow version A major and minor version for this workflow. This version will be visible via
the Workflow Manager after the workflow is installed, and will be included in the history of
elements generated using this workflow. The version of the workflow open in the Workflow
Editor, from which this installer is being created, will also be updated to the version
specified here.
Workflow description A textual description of the workflow. After installation, this is shown in
the Description tab of the Workflow Manager (section 13.6) and is also shown in a tooltip
when the cursor is hovered over the installed workflow in the Workflows menu in the Toolbox
panel at the bottom, left of the Workbench.
CHAPTER 13. WORKFLOWS 265
Figure 13.87: Provide information about the workflow that you are creating an installer for.
Ignore The data elements selected as inputs in the original workflow are not included in the
workflow installer.
Input options where Ignore is selected should generally be kept unlocked. If locked, the
data element referred to must present in the exact relative location used on your system
when creating the installer. If the option is locked, and the selected data element is not
present in the expected location, an error message is shown in the Workflow Manager when
the workflow is installed. It will not be possible to run that workflow until the relevant data
element is present in the expected location.
Bundle The data elements selected as inputs in the original workflow are included in the workflow
installer. This is a good choice when sharing the workflow with others who may not have
the relevant reference data on their system.
When installing a workflow with bundled data on a CLC Workbench, you are prompted where
to save the bundled data elements. If the workflow is on a CLC Server, the data elements
are saved automatically, as described in the CLC Server manual at:
http://resources.qiagenbioinformatics.com/manuals/clcserver/current/admin/index.php?manual=
Installing_configuring_workflows.html
Bundling is intended for use with small reference data elements. With larger elements,
the workflow installer can quickly become very big. In addition, each time such a workflow
is installed, a copy of the bundled data is included, even if the relevant data element is
CHAPTER 13. WORKFLOWS 266
When working with large data elements, leaving the input option unlocked and choosing the
Ignore option is recommended. In this case, the relevant data elements should be shared using
other means. For example, export the data and share this separately. The benefit with this, over
bundling, is that the data can be shared once, rather than with every workflow installer that refers
to it.
Installation options
The final step asks you to indicate whether to install the workflow directly to your Workbench
or to create an installer file, which can be used to install the workflow on any compatible CLC
Main Workbench or CLC Server (figure 13.89). If you are logged into a CLC Server as a user with
appropriate permissions, you will also have the option to install the workflow directly on the CLC
Server.
Figure 13.89: Select whether the workflow should be installed to your CLC Workbench or an
installer file (.cpw) should be created. Installation to the CLC Server is only available if you are
logged in as a user with permission to administer workflows on the CLC Server.
Workflows installed on a CLC Workbench cannot be updated in place. To update such a workflow,
make a copy, modify it, and then create a new installer. We recommend that you increase the
version number when creating the installer to help track your changes.
When you then install the updated copy of the workflow, a dialog will pop up with the message
"Workflow is already installed" (figure 13.90). You have the option to force the installation. This
CHAPTER 13. WORKFLOWS 267
will uninstall the existing workflow and install the modified version of the workflow. If you choose
to do this, the configuration of the original workflow will be gone.
Figure 13.90: Select whether you wish to force the installation of the workflow or keep the original
workflow.
Figure 13.91: Workflows available in the workflow manager. The alert on the "Variant detection"
workflow means that this workflow needs to be updated.
See section 13.6.2 for information about options for handling reference data inputs.
Information about installing workflows on a CLC Server is provided in the CLC Server manual at:
http://resources.qiagenbioinformatics.com/manuals/clcserver/current/admin/index.php?manual=Installing_
CHAPTER 13. WORKFLOWS 268
configuring_workflows.html
Part III
Bioinformatics
269
Chapter 14
Contents
14.1 Sequence Lists . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271
14.1.1 Creating sequence lists . . . . . . . . . . . . . . . . . . . . . . . . . . . 271
14.1.2 Graphical view of sequence lists . . . . . . . . . . . . . . . . . . . . . . 272
14.1.3 Table view of sequence lists . . . . . . . . . . . . . . . . . . . . . . . . 274
14.1.4 Annotation Table view of sequence lists . . . . . . . . . . . . . . . . . . 276
14.1.5 Working with paired sequences in lists . . . . . . . . . . . . . . . . . . . 276
14.2 View sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277
14.2.1 Sequence settings in Side Panel . . . . . . . . . . . . . . . . . . . . . . 277
14.2.2 Selecting parts of the sequence . . . . . . . . . . . . . . . . . . . . . . 283
14.2.3 Editing the sequence . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284
14.2.4 Sequence region types . . . . . . . . . . . . . . . . . . . . . . . . . . . 285
14.2.5 Circular DNA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285
14.3 Working with annotations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289
14.3.1 Viewing annotations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289
14.3.2 Adding annotations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293
14.3.3 Editing annotations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295
14.3.4 Export annotations to a gff3 format file . . . . . . . . . . . . . . . . . . . 296
14.3.5 Removing annotations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297
14.4 Element information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 298
14.5 View as text . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 299
Sequence information is stored in sequence elements or sequence lists. This chapter describes
basic functionality for creating an working with these types of elements. Most functionality
available for sequence elements is also available for sequence lists.
When you import multiple sequences, they are generally put into a sequence list, and this is the
element type in use for most types of work. See chapter 7.
270
CHAPTER 14. VIEWING AND EDITING SEQUENCES 271
Figure 14.1: Two views of a sequence list are open in linked views, graphical at the top, and tabular
at the bottom. Each view can be customized individually using settings in its side panel on the right.
• When sequences are downloaded, for example using tools under the Download menu.
• Extract Sequences Extracts all sequences from the sequence list. If your aim is to extract
a subset of the sequences, this can be done from the Table ( ) view (see section 14.1.3)
or using Split Sequence List ( ) (see section 27.7).
• Add Sequences Add sequences from sequence elements or sequence lists to this sequence
list.
• Delete All Annotations from All Sequences Deleting all annotations on all sequences can
be done with this option for sequence lists with 1000 or fewer sequences. In other cases,
or for more control over the annotations to delete, use the Annotation Table ( ) view,
described further below.
• Sort the sequence list alphabetically by sequence name, by length or by marked status.
These options are only available for sequence lists with 1000 or fewer sequences.
• Delete sequences that have been marked This option is enabled when at least one
sequence has been marked. Marking sequences is described below.
Tips for working with larger sequence lists are given later in this section.
Marking sequences:
Sequences in a list can be marked. Once marked, those sequences can be deleted, or the
sequence list can be sorted based on whether sequences are marked or not. It is easy to
adjust markings on many sequences using the options in the right-click menu on selection boxes
(figure 14.4).
To mark sequences:
CHAPTER 14. VIEWING AND EDITING SEQUENCES 273
1. Check the Show selection boxes option in the "Sequence List Settings" section of the side
panel settings on the right hand side.
This makes checkboxes visible to the right of each sequence name.
Figure 14.2: Options to extract the sequences in the list, add sequences to the list, and to delete
all annotations on all sequences are available when you right-click on a blank area of the graphical
view of a sequence list.
• Sorting long lists can be done in Table ( ) view. For example, to sort on length, ensure
the Size column is enabled in the side panel to the right of the table, and then click on
the Size column header to sort the list. If a Table view is open as a linked view with the
graphical view, clicking on a row in the table will highlight that sequence in the graphical
view. See section 2.1 for information on linked views and section 9 for information about
working with tables.
• Deleting annotations on sequences can be done in the Annotation Table ( ) view. Right
click anywhere in this view to reveal a menu containing relevant options. To delete all
annotations on the sequence list, ensure all annotation types are enabled in the side panel
settings to the right.
• To delete many sequences from a list, you can mark the few you wish to retain, and then
invert the marking by right-clicking on any selection checkbox and choosing the option Invert
All Marks (figure 14.4).
Then right-click on any sequence or sequence name and choose the option to Delete
Marked Sequences (figure 14.3). If the sequence list contains more than 1000 sequences,
a warning will appear noting that, if you proceed, the deletion cannot be undone.
CHAPTER 14. VIEWING AND EDITING SEQUENCES 274
Figure 14.3: Options to rename, select, open, or delete a sequence are available when you
right-click on the name or residues for a given sequence. Also in this menu are options for sorting
the list and deleting marked sequences.
Figure 14.4: Which sequences are marked can be quickly adjusted using the options in the
right-click menu for any selection checkbox. The Show Selection boxes option in the side panel
must be enabled to see these boxes.
• Renaming multiple sequences in a list following the same renaming pattern can be done
using the dedicated tool, Rename Sequences in Lists, described in section 27.9.
Figure 14.5: In Table view there is a row for each sequence in the sequence list. The number of
rows equates to the number of sequences and is reported at the top left side. Right-click to display
a menu with actions. This menu differs slightly depending on which column you click upon.
• Add sequences Add sequences to this list by dragging and dropping sequence elements or
sequence lists from the Navigation Area into the table. Sequences can also be added from
the graphical view using a right-click option, as described earlier in this section.
• Copy sequence names Select the relevant rows, right-click and choose Copy Sequence
Names from the menu. This list can be used within the Workbench, for example, in table
filters with the action "is in list" or "is not in list" to find these names in other elements, or
they can be pasted to other programs that accept text lists, such as Excel or text editors.
• Edit attributes Right-click in the the cell you wish to edit, and then update the contents of
that cell. For example, if you right-click on a cell in the Name column, an option called "Edit
Name..." will be in the menu presented (figure 14.5).
If you select multiple rows, you will be able to edit the attribute, with the value you provide
being applied to all the selected rows.
Values calculated from the sequence itself cannot be edited directly. E.g. The Size column
contains the length of each sequence, and the Start of sequence column contains the first
50 residues.
To a new sequence list by selecting relevant rows and clicking on the Create New
Sequence List button. This new list must be saved if you wish to keep it.
To a individual sequence elements by selecting relevant rows and dragging them into
the Navigation Area.
CHAPTER 14. VIEWING AND EDITING SEQUENCES 276
Adding attributes
Attributes (columns in Table view) can be added using the right-click menu option Add Attributes.
This is good for small lists and simple changes. You are prompted for an attribute name and a
single value. A new column is added to the table with the name you provide, and the value you
provided is added for all of the selected rows. This option can also be used to edit contents of
an existing column, if desired.
The Update Sequence Attributes in Lists tool supports more detailed work, including importing
from external sources, such as Excel and CSV format files. See (section 27.6) for more details.
Figure 14.6: A warning appears when trying to create a new sequence list from a mixture of paired
and unpaired sequence lists.
CHAPTER 14. VIEWING AND EDITING SEQUENCES 277
Figure 14.7: Overview of the Side Panel for a sequence. Each tab can be expanded to reveal
settings that can be configured.
Sequence Layout
These preferences determine the overall layout of the sequence:
• Double stranded. Shows both strands of a sequence (only applies to DNA sequences).
• Numbers on sequences. Shows residue positions along the sequence. The starting point
can be changed by setting the number in the field below. If you set it to e.g. 101, the first
residue will have the position of -100. This can also be done by right-clicking an annotation
and choosing Set Numbers Relative to This Annotation.
• Numbers on plus strand. Whether to set the numbers relative to the positive or the negative
strand in a nucleotide sequence (only applies to DNA sequences).
• Lock numbers. When you scroll vertically, the position numbers remain visible. (Only
possible when the sequence is not wrapped.)
• Lock labels. When you scroll horizontally, the label of the sequence remains visible.
Restriction sites
Please see section 23.1.1.
Motifs
See section 18.9.1.
Residue coloring
These preferences make it possible to color both the residue letter and set a background color
for the residue.
• Non-standard residues. For nucleotide sequences this will color the residues that are not
C, G, A, T or U. For amino acids only B, Z, and X are colored as non-standard residues.
Foreground color. Sets the color of the letter. Click the color box to change the color.
CHAPTER 14. VIEWING AND EDITING SEQUENCES 279
Background color. Sets the background color of the residues. Click the color box to
change the color.
• Rasmol colors. Colors the residues according to the Rasmol color scheme.
See http://www.openrasmol.org/doc/rasmol.html
Foreground color. Sets the color of the letter. Click the color box to change the color.
Background color. Sets the background color of the residues. Click the color box to
change the color.
• Polarity colors (only protein). Colors the residues according to the following categories:
• Trace colors (only DNA). Colors the residues according to the color conventions of
chromatogram traces: A=green, C=blue, G=black, and T=red.
Nucleotide info
These preferences apply only to nucleotide sequences.
The data points for graph representations can be exported (see section 8.3).
• Translation. Displays a translation into protein just below the nucleotide sequence.
Depending on the zoom level, the amino acids are displayed with three letters or one letter.
In cases where variants are present in the reads, synonymous variants are shown in orange
in the translated sequence whereas non-synonymous are shown in red.
∗ Selection. This option will only take effect when you make a selection on the
sequence. The translation will start from the first nucleotide selected. Making a
new selection will automatically display the corresponding translation. Read more
about selecting in section 14.2.2.
∗ +1 to -1. Select one of the six reading frames.
∗ All forward/All reverse. Shows either all forward or all reverse reading frames.
∗ All. Select all reading frames at once. The translations will be displayed on top of
each other.
Table. The translation table to use in the translation. For more about translation
tables, see section 19.4.
Only AUG start codons. For most genetic codes, a number of codons can be start
codons (TTG, CTG, or ATG). These will be colored green, unless selecting the "Only
AUG start codons" option, which will result in only the AUG codons colored in green.
Single letter codes. Choose to represent the amino acids with a single letter instead
of three letters.
• Quality scores. For sequencing data containing quality scores, the quality score information
can be displayed along the sequence.
Show as probabilities. Converts quality scores to error probabilities on a 0-1 scale,
i.e. not log-transformed.
Foreground color. Colors the letter using a gradient, where the left side color is used
for low quality and the right side color is used for high quality. The sliders just above
the gradient color box can be dragged to highlight relevant levels. The colors can be
changed by clicking the box. This will show a list of gradients to choose from.
Background color. Sets a background color of the residues using a gradient in the
same way as described above.
Graph. The quality scores are displayed as a graph.
∗ Height. Specifies the height of the graph.
∗ Type. The graph can be displayed as Line plot, Bar plot or as a Color bar.
∗ Color box. For Line and Bar plots, the color of the plot can be set by clicking
the color box. For Colors, the color box is replaced by a gradient color box as
described under Foreground color.
• Trace data. See section 21.1.
• G/C content. Calculates the G/C content of a part of the sequence and shows it as a
gradient of colors or as a graph below the sequence.
Window length. Determines the length of the part of the sequence to calculate. A
window length of 9 will calculate the G/C content for the nucleotide in question plus
the 4 nucleotides to the left and the 4 nucleotides to the right. A narrow window will
focus on small fluctuations in the G/C content level, whereas a wider window will show
fluctuations between larger parts of the sequence.
Foreground color. Colors the letter using a gradient, where the left side color is used
for low levels of G/C content and the right side color is used for high levels of G/C
content. The sliders just above the gradient color box can be dragged to highlight
relevant levels of G/C content. The colors can be changed by clicking the box. This
will show a list of gradients to choose from.
CHAPTER 14. VIEWING AND EDITING SEQUENCES 281
Background color. Sets a background color of the residues using a gradient in the
same way as described above.
Graph. The G/C content levels are displayed as a graph.
∗ Height. Specifies the height of the graph.
∗ Type. The graph can be displayed as Line plot, Bar plot or as a Color bar.
∗ Color box. For Line and Bar plots, the color of the plot can be set by clicking
the color box. For Colors, the color box is replaced by a gradient color box as
described under Foreground color.
When zoomed out, the graph displays G/C content only for a subset of evenly spaced
positions. Because insertions shifts reference positions, zoomed-out graphs with and
without insertions may not be directly comparable, as G/C content may be displayed
for different positions.
• Secondary structure. Allows you to choose how to display a symbolic representation of the
secondary structure along the sequence.
See section 24.2.3 for a detailed description of the settings.
Protein info
These preferences only apply to proteins. The first nine items are different hydrophobicity scales.
These are described in section 20.3.1.
• Kyte-Doolittle. The Kyte-Doolittle scale is widely used for detecting hydrophobic regions
in proteins. Regions with a positive value are hydrophobic. This scale can be used for
identifying both surface-exposed regions as well as transmembrane regions, depending
on the window size used. Short window sizes of 5-7 generally work well for predicting
putative surface-exposed regions. Large window sizes of 19-21 are well suited for finding
transmembrane domains if the values calculated are above 1.6 [Kyte and Doolittle, 1982].
These values should be used as a rule of thumb and deviations from the rule may occur.
• Engelman. The Engelman hydrophobicity scale, also known as the GES-scale, is another
scale which can be used for prediction of protein hydrophobicity [Engelman et al., 1986].
As the Kyte-Doolittle scale, this scale is useful for predicting transmembrane regions in
proteins.
• Rose. The hydrophobicity scale by Rose et al. is correlated to the average area of buried
amino acids in globular proteins [Rose et al., 1985]. This results in a scale which is not
showing the helices of a protein, but rather the surface accessibility.
• Janin. This scale also provides information about the accessible and buried amino acid
residues of globular proteins [Janin, 1979].
CHAPTER 14. VIEWING AND EDITING SEQUENCES 282
• Hopp-Woods. Hopp and Woods developed their hydrophobicity scale for identification of
potentially antigenic sites in proteins. This scale is basically a hydrophilic index where
apolar residues have been assigned negative values. Antigenic sites are likely to be
predicted when using a window size of 7 [Hopp and Woods, 1983].
• Welling. [Welling et al., 1985] Welling et al. used information on the relative occurrence of
amino acids in antigenic regions to make a scale which is useful for prediction of antigenic
regions. This method is better than the Hopp-Woods scale of hydrophobicity which is also
used to identify antigenic regions.
• Surface Probability. Display of surface probability based on the algorithm by [Emini et al.,
1985]. This algorithm has been used to identify antigenic determinants on the surface of
proteins.
• Chain Flexibility. Display of backbone chain flexibility based on the algorithm by [Karplus
and Schulz, 1985]. It is known that chain flexibility is an indication of a putative antigenic
determinant.
Find
The Find function can be used for searching the sequence and is invoked by pressing Ctrl +
Shift + F ( + Shift + F on Mac). Initially, specify the 'search term' to be found, select the type
of search (see various options in the following) and finally click on the Find button. The first
occurrence of the search term will then be highlighted. Clicking the find button again will find the
next occurrence and so on. If the search string is found, the corresponding part of the sequence
will be selected.
• Search term. Enter the text or number to search for. The search function does not
discriminate between lower and upper case characters.
• Sequence search. Search the nucleotides or amino acids. For amino acids, the single
letter abbreviations should be used for searching. The sequence search also has a set of
advanced search parameters:
Include negative strand. This will search on the negative strand as well.
Treat ambiguous characters as wildcards in search term. If you search for e.g. ATN,
you will find both ATG and ATC. If you wish to find literally exact matches for ATN (i.e.
only find ATN - not ATG), this option should not be selected.
Treat ambiguous characters as wildcards in sequence. If you search for e.g. ATG, you
will find both ATG and ATN. If you have large regions of Ns, this option should not be
selected.
Note that if you enter a position instead of a sequence, it will automatically switch to
position search.
CHAPTER 14. VIEWING AND EDITING SEQUENCES 283
• Annotation search. Search the annotations on the sequence. The search is performed both
on the labels of the annotations, but also on the text appearing in the tooltip that you see
when you keep the mouse cursor fixed. If the search term is found, the part of the sequence
corresponding to the matching annotation is selected. The option "Include translations"
means that you can choose to search for translations which are part of an annotation (in
some cases, CDS annotations contain the amino acid sequence in a "/translation" field).
But it will not dynamically translate nucleotide sequences, nor will it search the translations
that can enabled using the "Nucleotide info" side panel.
• Position search. Find a specific position on the sequence. In order to find an interval, e.g.
from position 500 to 570, enter "500..570" in the search field. This will make a selection
from position 500 to 570 (both included). Notice the two periods (..) between the start
an end number. If you enter positions including thousands separators like 123,345, the
comma will just be ignored and it would be equivalent to entering 123345.
• Include negative strand. When searching the sequence for nucleotides or amino acids, you
can search on both strands.
• Name search. Search for sequence names. This is useful for searching sequence lists and
mapping results for example.
This concludes the description of the View Preferences. Next, the options for selecting and
editing sequences are described.
Text format
These preferences allow you to adjust the format of all the text in the view (both residue letters,
sequence name and translations if they are shown).
• Text size. Specify a font size for the text in the view.
• Font. Specify a font for the text in the view.
• Bold. Make the text for the residues bold.
or press and hold the Shift key while using the right and left arrow keys to adjust the
right side of the selection.
If you wish to select the entire sequence:
double-click the sequence name to the left
Open a selection in a new view A selection can be opened in a new view and saved as a new
sequence:
right-click the selection | Open selection in New View ( )
This opens the annotated part of the sequence in a new view. The new sequence can be saved
by dragging the tab of the sequence view into the Navigation Area.
The process described above is also the way to manually translate coding parts of sequences
(CDS) into protein. You simply translate the new sequence into protein. This is done by:
right-click the tab of the new sequence | Tools | Nucleotide Analysis ( )| Translate
to Protein ( )
A selection can also be copied to the clipboard and pasted into another program:
make a selection | Ctrl + C ( + C on Mac)
Note! The annotations covering the selection will not be copied.
A selection of a sequence can be edited as described in the following section.
Figure 14.8: Three regions on a human beta globin DNA sequence (HUMHBB).
Figure 14.9 shows an artificial sequence with all the different kinds of regions.
• Similarities:
Figure 14.9: Region #1: A single residue, Region #2: A range of residues including both endpoints,
Region #3: A range of residues starting somewhere before 30 and continuing up to and including
40, Region #4: A single residue somewhere between 50 and 60 inclusive, Region #5: A range of
residues beginning somewhere between 70 and 80 inclusive and ending at 90 inclusive, Region #6:
A range of residues beginning somewhere between 100 and 110 inclusive and ending somewhere
between 120 and 130 inclusive, Region #7: A site between residues 140 and 141, Region #8:
A site between two residues somewhere between 150 and 160 inclusive, Region #9: A region
that covers ranges from 170 to 180 inclusive and 190 to 200 inclusive, Region #10: A region on
negative strand that covers ranges from 210 to 220 inclusive, Region #11: A region on negative
strand that covers ranges from 230 to 240 inclusive and 250 to 260 inclusive.
• Differences:
In the Sequence Layout preferences, only the following options are available in the
circular view: Numbers on plus strand, Numbers on sequence and Sequence label.
You cannot zoom in to see the residues in the circular molecule. If you wish to see
these details, split the view with a linear view of the sequence
In the Annotation Layout, you also have the option of showing the labels as Stacked.
This means that there are no overlapping labels and that all labels of both annotations
and restriction sites are adjusted along the left and right edges of the view.
To see the nucleotides of a circular molecule you can open a new view displaying a circular view
of the molecule:
Press and hold the Ctrl button ( on Mac) | click Show Sequence ( ) at the
bottom of the view
This will open a linear view of the sequence below the circular view. When you zoom in on the
linear view you can see the residues as shown in figure 14.11.
Figure 14.11: Two views showing the same sequence. The bottom view is zoomed in.
Note! If you make a selection in one of the views, the other view will also make the corresponding
selection, providing an easy way for you to focus on the same region in both views.
Figure 14.12: Double angle brackets mark the start and end of a circular sequence in linear view
(top). The first line in the text view (bottom) contains information that the sequence is circular.
Figure 14.13: Right-click on a circular sequence to move the starting point to the selected position.
CHAPTER 14. VIEWING AND EDITING SEQUENCES 289
• In some of the data formats that can be imported into CLC Main Workbench, sequences
can have annotations (GenBank, EMBL and Swiss-Prot format).
• The result of a number of analyses in CLC Main Workbench are annotations on the sequence
(e.g. finding open reading frames and restriction map analysis).
• A protein structure can be linked with a sequence (section 15.4.2), and atom groups defined
on the structure transferred to sequence annotations or vica versa (section 15.4.3).
• You can manually add annotations to a sequence (described in the section 14.3.2).
If you would like to extract parts of a sequence (or several sequences) based on its annotations,
you can find a description of how to do this in section 27.1.
Note! Annotations are included if you export the sequence in GenBank, Swiss-Prot, EMBL or CLC
format. When exporting in other formats, annotations are not preserved in the exported file.
• As graphical arrows or boxes in all views displaying sequences (sequence lists, alignments
etc)
• Annotation Layout
• Annotation Types
CHAPTER 14. VIEWING AND EDITING SEQUENCES 290
Figure 14.15: The annotation layout in the Side Panel. The annotation types can be shown by
clicking on the "Annotation types" tab.
• Position.
On sequence. The annotations are placed on the sequence. The residues are visible
through the annotations (if you have zoomed in to 100%).
Next to sequence. The annotations are placed above the sequence.
Separate layer. The annotations are placed above the sequence and above restriction
sites (only applicable for nucleotide sequences).
• Offset. If several annotations cover the same part of a sequence, they can be spread out.
Piled. The annotations are piled on top of each other. Only the one at front is visible.
Little offset. The annotations are piled on top of each other, but they have been offset
a little.
More offset. Same as above, but with more spreading.
CHAPTER 14. VIEWING AND EDITING SEQUENCES 291
Most offset. The annotations are placed above each other with a little space between.
This can take up a lot of space on the screen.
• Label. The name of the annotation can shown as a label. Additional information about the
sequence is shown if you place the mouse cursor on the annotation and keep it still.
• Show arrows. Displays the end of the annotation as an arrow. This can be useful to see
the orientation of the annotation (for DNA sequences). Annotations on the negative strand
will have an arrow pointing to the left.
In the Annotation types group, you can choose which kinds of annotations that should be
displayed. This group lists all the types of annotations that are attached to the sequence(s) in the
view. For sequences with many annotations, it can be easier to get an overview if you deselect
the annotation types that are not relevant.
Unchecking the checkboxes in the Annotation layout will not remove this type of annotations
them from the sequence - it will just hide them from the view.
Besides selecting which types of annotations that should be displayed, the Annotation types
group is also used to change the color of the annotations on the sequence. Click the colored
square next to the relevant annotation type to change the color.
This will display a dialog with five tabs: Swatches, HSB, HSI, RGB, and CMYK. They represent
five different ways of specifying colors. Apply your settings and click OK. When you click OK, the
color settings cannot be reset. The Reset function only works for changes made before pressing
OK.
Furthermore, the Annotation types can be used to easily browse the annotations by clicking the
small button ( ) next to the type. This will display a list of the annotations of that type (see
figure 14.16).
Clicking an annotation in the list will select this region on the sequence. In this way, you can
quickly find a specific annotation on a long sequence.
Note: A waved end on an annotation (figure 14.17) means that the annotation is torn, i.e.,
it extends beyond the sequence displayed. An annotation can be torn when a new, smaller
sequence has been created from a larger sequence. A common example of this situation is when
you select a section of a stand-alone sequence and open it in a new view. If there are annotations
present within this selected region that extend beyond the selection, then the selected sequence
shown in the new view will exhibit these torn annotations.
CHAPTER 14. VIEWING AND EDITING SEQUENCES 292
• Name.
• Type.
• Region.
CHAPTER 14. VIEWING AND EDITING SEQUENCES 293
• Qualifiers.
This information corresponds to the information in the dialog when you edit and add annotations
(see section 14.3.2).
The Name, Type and Region for each annotation can be edited simply by double-clicking, typing
the change directly, and pressing Enter. See section 14.3.3 for further information about editing
annotations.
• Name. The name of the annotation which can be shown on the label in the sequence views.
1
Note that your own annotation types will be converted to "unsure" when exporting in GenBank format. As long as
you use the sequence in CLC format, you own annotation type will be preserved
CHAPTER 14. VIEWING AND EDITING SEQUENCES 294
(Whether the name is actually shown depends on the Annotation Layout preferences, see
section 14.3.1).
• Type. Reflects the left-hand part of the dialog as described above. You can also choose
directly in this list or type your own annotation type.
• Region. If you have already made a selection, this field will show the positions of the
selection. You can modify the region further using the conventions of DDBJ, EMBL and
GenBank. The following are examples of how to use the syntax (based on https:
//www.insdc.org/submitting-standards/feature-table/):
• Annotations. In this field, you can add more information about the annotation like comments
and links. Click the Add qualifier/key button to enter information. Select a qualifier which
describes the kind of information you wish to add. If an appropriate qualifier is not present
in the list, you can type your own qualifier. The pre-defined qualifiers are derived from
the GenBank format. You can add as many qualifier/key lines as you wish by clicking the
button. Redundant lines can be removed by clicking the delete icon ( ). The information
entered on these lines is shown in the annotation table (see section 14.3.1) and in the
yellow box which appears when you place the mouse cursor on the annotation. If you write
a hyperlink in the Key text field, like e.g. "digitalinsights.qiagen.com", it will be recognized
as a hyperlink. Clicking the link in the annotation table will open a web browser.
Figure 14.20: The right-click menu in the Annotation Table view contains options for adding, editing,
exporting and deleting annotations.
• Edit Annotation... This option is only enabled if a single annotation is selected in the table.
It will open the same dialog used to edit annotations from the sequence view (figure 14.19).
• Advanced Rename... Choose this to rename the selected annotations using qualifiers or
annotation types. The options in the Rename dialog (figure 14.21) are:
Use this qualifier Choose the qualifier to use as that annotation name from a drop-
down list of qualifiers available in the selected annotations. Selected annotations that
do not include the selected qualifier will not be renamed. If an annotation has multiple
qualifiers of the same type, the first is used for renaming.
Use annotation type as name The annotation's type will be used for the annotation
name E.g. if you have an annotation of type "Promoter", it will get "Promoter" as its
name by using this option.
• Advanced Retype... Choose this to edit the type of one or more annotations. The options
in the Retype dialog (figure 14.22) are:
Use this qualifier Choose the qualifier to use as the annotation type from a drop-down
list of qualifiers available in the selected annotations. Selected annotations that do
not include the selected qualifier will not be retyped. If an annotation has multiple
qualifiers of the same type, the first is used for the new type.
New type Enter an annotation type to apply or click on the arrows at the right of the
field to see a drop-down list of pre-defined annotation types.
Use annotation name as type Use the annotation name as its type. E.g. if you have an
annotation named "Promoter", it will get "Promoter" as its type by using this option.
Figure 14.23: The initial display of sequence info for the HUMHBB DNA sequence from the Example
data.
All the lines in the view are headings, and the corresponding text can be shown by clicking the
text. The information available depends on the origin of the sequence.
• Name. The name of the sequence which is also shown in sequence views and in the
Navigation Area.
• Metadata. The Metadata table and the detailed metadata values associated with the
sequence.
• Gb Division. Abbreviation of GenBank divisions. See section 3.3 in the GenBank release
notes for a full list of GenBank divisions.
CHAPTER 14. VIEWING AND EDITING SEQUENCES 299
• Modification date. Modification date from the database. This means that this date does
not reflect your own changes to the sequence. See the History view, described in section
2.5 for information about the latest changes to the sequence after it was downloaded from
the database.
• Read group Read group identifier "ID", technology used to produced the reads "Platform",
and sample name "Sample".
• Paired Status. Unpaired or Paired sequences, with in this case the Minimum and Maximum
distances as well as the Read orientation set during import.
Some of the information can be edited by clicking the blue Edit text. This means that you can
add your own information to sequences that do not derive from databases.
3D Molecule Viewer
Contents
15.1 Importing molecule structure files . . . . . . . . . . . . . . . . . . . . . . . . 302
15.1.1 From the Protein Data Bank . . . . . . . . . . . . . . . . . . . . . . . . . 302
15.1.2 From your own file system . . . . . . . . . . . . . . . . . . . . . . . . . . 302
15.1.3 BLAST search against the PDB database . . . . . . . . . . . . . . . . . . 303
15.1.4 Import issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304
15.2 Viewing molecular structures in 3D . . . . . . . . . . . . . . . . . . . . . . . 305
15.3 Customizing the visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . 306
15.3.1 Visualization styles and colors . . . . . . . . . . . . . . . . . . . . . . . 307
15.3.2 Project settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313
15.4 Tools for linking sequence and structure . . . . . . . . . . . . . . . . . . . . 315
15.4.1 Show sequence associated with molecule . . . . . . . . . . . . . . . . . 316
15.4.2 Link sequence or sequence alignment to structure . . . . . . . . . . . . 316
15.4.3 Transfer annotations between sequence and structure . . . . . . . . . . 317
15.5 Align Protein Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 319
15.5.1 Example: alignment of calmodulin . . . . . . . . . . . . . . . . . . . . . 320
15.5.2 The Align Protein Structure algorithm . . . . . . . . . . . . . . . . . . . . 324
15.6 Generate Biomolecule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324
Proteins are amino acid polymers that are involved in all aspects of cellular function. The structure
of a protein is defined by its particular amino acid sequence, with the amino acid sequence being
referred to as the primary protein structure. The amino acids fold up in local structural elements;
helices and sheets, also called the secondary structure of the protein. These structural elements
are then packed into globular folds, known as the tertiary structure or the three dimensional
structure.
In order to understand protein function it is often valuable to see the three dimensional structure
of the protein. This is possible when the structure of the protein has been resolved and published.
Structure files are usually deposited in the Protein Data Bank (PDB) https://www.rcsb.org/,
where the publicly available protein structure files can be searched and downloaded. The vast
majority of the protein structures have been determined by X-ray crystallography (88%) while
the rest of the structures predominantly have been obtained by Nuclear Magnetic Resonance
techniques.
300
CHAPTER 15. 3D MOLECULE VIEWER 301
In addition to protein structures, the PDB entries also contain structural information about
molecules that interact with the protein, such as nucleic acids, ligands, cofactors, and water.
There are also entries, which contain nucleic acids and no protein structure. The 3D Molecule
Viewer in the CLC Main Workbench is an integrated viewer of such structure files.
If you have problems viewing 3D structures, please check your system matches the
requirements for 3D viewers. See section 1.3.
The 3D Molecule Viewer offers a range of tools for inspection and visualization of molecular
structures:
• Automatic sorting of molecules into categories: Proteins, Nucleic acids, Ligands, Cofactors,
Water molecules
• Browse amino acids and nucleic acids from sequence editors started from within the 3D
Molecule Viewer
• Automatic sorting of molecules into categories: Proteins, Nucleic acids, Ligands, Cofactors,
Water molecules
• Browse amino acids and nucleic acids from sequence editors started from within the 3D
Molecule Viewer
CHAPTER 15. 3D MOLECULE VIEWER 302
Figure 15.1: Download protein structure from the Protein Data Bank. It is possible to open a
structure file directly from the output of the search by clicking the "Download and Open" button or
by double clicking directly on the relevant row.
Select the molecule structure of interest and click on the button labeled "Download and Open" -
or double click on the relevant row - in the table to open the protein structure.
Pressing the "Download and Save" button will save the molecule structure at a user defined
destination in the Navigation Area.
The button "Open at NCBI" links directly to the structure summary page at NCBI: clicking this
button will open individual NCBI pages describing each of the selected molecule structures.
In the Import dialog, select the structure(s) of interest from a data location and tick "Automatic
import" (figure 15.2). Specify where to save the imported PDB file and click Finish.
Double clicking on the imported file in the Navigation Area will open the structure as a Molecule
Project in the View Area of the CLC Main Workbench. Another option is to drag the PDB file from
the Navigation Area to the View Area. This will automatically open the protein structure as a
Molecule Project.
Figure 15.2: A PDB file can be imported using the Standard Import tool.
Figure 15.3: Select the input sequence of interest. In this example a protein sequence for ATPase
class I type 8A member 1 and an ATPase ortholog from S. pombe have been selected.
Click Next and choose program and database (figure 15.4). When a protein sequence has been
used as input, select "Program: blastp: Protein sequence and database" and "Database: Protein
CHAPTER 15. 3D MOLECULE VIEWER 304
Please refer to section 26.1.1 for further description of the individual parameters in the wizard
steps.
When you click on the button labeled Finish, a BLAST output is generated that shows local
sequence alignments between your input sequence and a list of matching proteins with known
structures available.
Note! The BLAST at NCBI search can take up to several minutes, especially when mRNA and
genomic sequences are used as input.
Switch to the "BLAST Table" editor view to select the desired entry (figure 15.5). If you have
performed a multi BLAST, to get access to the "BLAST Table" view, you must first double click
on each row to open the entries individually.
In this view four different options are available:
• Download and Open The sequence that has been selected in the table is downloaded and
opened in the View Area.
• Download and Save The sequence that has been selected in the table is downloaded and
saved in the Navigation Area.
• Open at NCBI The protein sequence that has been selected in the table is opened at NCBI.
• Open Structure Opens the selected structure in a Molecule Project in the View Area.
Figure 15.5: Top: The output from "BLAST at NCBI". Bottom: The "BLAST table". One of the protein
sequences has been selected. This activates the four buttons under the table. Note that the table
and the BLAST Graphics are linked, this means that when a sequence is selected in the table, the
same sequence will be highlighted in the BLAST Graphics view.
list is linked with the molecules in the 3D view, such that selecting an entry in the list will select
the implicated atoms in the view, and zoom to put them into the center of the 3D view.
Figure 15.6: At the bottom of the Molecule Project it is possible to switch to the "Show Issues" view
by clicking on the "table-with-exclamation-mark" icon.
If you have problems viewing 3D structures, please check your system matches the
requirements for 3D viewers. See section 1.3.
CHAPTER 15. 3D MOLECULE VIEWER 306
Figure 15.7: 3D view of a calcium ATPase. All molecules in the PDB file are shown in the Molecule
Project. The Project Tree in the right side of the window lists the involved molecules.
Moving and rotating The molecules can be rotated by holding down the left mouse button while
moving the mouse. The right mouse button can be used to move the view.
Zooming can be done with the scroll-wheel or by holding down both left and right buttons while
moving the mouse up and down.
All molecules in the Molecule Project are listed in categories in the Project Tree. The individual
molecules or whole categories can be hidden from the view by un-cheking the boxes next to them.
It is possible to bring a particular molecule or a category of molecules into focus by selecting
the molecule or category of interest in the Project Tree view and double-click on the molecule or
category of interest. Another option is to use the zoom-to-fit button ( ) at the bottom of the
Project Tree view.
Troubleshooting 3D graphics errors The 3D viewer uses OpenGL graphics hardware acceleration
in order to provide the best possible experience. If you experience any graphics problems with
the 3D view, please make sure that the drivers for your graphics card are up-to-date.
If the problems persist after upgrading the graphics card drivers, it is possible to change to a
rendering mode, which is compatible with a wider range of graphic cards. To change the graphics
mode go to Edit in the menu bar, select "Preferences", Click on "View", scroll down to the bottom
and find "Molecule Project 3D Editor" and uncheck the box "Use modern OpenGL rendering".
Finally, it should be noted that certain types of visualization are more demanding than others. In
particular, using multiple molecular surfaces may result in slower drawing, and even result in the
graphics card running out of available memory. Consider creating a single combined surface (by
using a selection) instead of creating surfaces for each single object. For molecules with a large
number of atoms, changing to wireframe rendering and hiding hydrogen atoms can also greatly
improve drawing speed.
category (or a mixture), by selecting the name of either the molecule or the category. Holding
down the Ctrl (Cmd on Mac) or shift key while clicking the entry names in the Project Tree will
select multiple molecules/categories.
The six leftmost quick-style buttons below the Project Tree view give access to the molecule
visualization styles, while context menus on the buttons (accessible via right-click or left-click-
hold) give access to the color schemes available for the visualization styles. Visualization styles
and color schemes are also available from context menus directly on the selected entries in
the Project Tree. Other quick-style buttons are available for displaying hydrogen bonds between
Project Tree entries, for displaying labels in the 3D view and for creating custom atom groups.
They are all described in detail below.
Note! Whenever you wish to change the visualization styles by right-clicking the entries in the
Project Tree, please be aware that you must first click on the entry of interest, and ensure it is
highlighted in blue, before right-clicking.
• Color by Element. Classic CPK coloring based on atom type (e.g. oxygen red, carbon gray,
hydrogen white, nitrogen blue, sulfur yellow).
• Color by Temperature. For PDB files, this is based on the b-factors. For structure models
created with tools in a CLC workbench, this is based on an estimate of the local model
quality. The color scale goes from blue (0) over white (50) to red (100). The b-factors as
well as the local model quality estimate are measures of uncertainty or disorder in the atom
position; the higher the number, the higher the uncertainty.
• Color Carbons by Entry. Each entry (molecule or atom group) is assigned its own specific
color. Only carbon atoms are colored by the specific color, other atoms are colored by
element.
• Color by Entry. Each entry (molecule or atom group) is assigned its own specific color.
• Custom Carbon Color. The user selects a molecule color from a palette. Only carbon atoms
are colored by the specific color, other atoms are colored by element.
Backbone
( )
CHAPTER 15. 3D MOLECULE VIEWER 308
For the molecules in the Proteins and Nucleic Acids categories, the backbone structure can be
visualized in a schematic rendering, highlighting the secondary structure elements for proteins
and matching base pairs for nucleic acids. The backbone visualization can be combined with any
of the atom-level visualizations.
Five color schemes are available for backbone structures:
• Color by Residue Position. Rainbow color scale going from blue over green to yellow and
red, following the residue number.
• Color by Type. For proteins, beta sheets are blue, helices red and loops/coil gray. For
nucleic acids backbone ribbons are white while the individual nucleotides are indicated in
green (T/U), red (A), yellow (G), and blue (C).
• Color by Backbone Temperature. For PDB files, this is based on the b-factors for the Cα
atoms (the central carbon atom in each amino acid). For structure models created with
tools in the workbench, this is based on an estimate of the local model quality. The color
scale goes from blue (0) over white (50) to red (100). The b-factors as well as the local
model quality estimate are measures of uncertainty or disorder in the atom position; the
higher the number, the higher the uncertainty.
Surfaces
( )
Molecular surfaces can be visualized.
Five color schemes are available for surfaces:
• Color by Charge. Charged amino acids close to the surface will show as red (negative) or
blue (positive) areas on the surface, with a color gradient that depends on the distance of
the charged atom to the surface.
• Color by Element. Smoothed out coloring based on the classic CPK coloring of the
heteroatoms close to the surface.
• Color by Temperature. Smoothed out coloring based on the temperature values assigned
to atoms close to the surface (See the "Wireframe, Stick, Ball and stick, Space-filling/CPK"
section above).
A surface spanning multiple molecules can be visualized by creating a custom atom group that
includes all atoms from the molecules (see section 15.3.1).
It is possible to adjust the opacity of a surface by adjusting the transparency slider at the bottom
of the menu.
CHAPTER 15. 3D MOLECULE VIEWER 309
Notice that visual artifacts may appear when rotating a transparent surface. These artifacts
disappear as soon as the mouse is released.
Labels
( )
Labels can be added to the molecules in the view by selecting an entry in the Project Tree and
clicking the label button at the bottom of the Project Tree view. The color of the labels can be
adjusted from the context menu by right clicking on the selected entry (which must be highlighted
in blue first) or on the label button in the bottom of the Project Tree view (see figure 15.9).
Figure 15.9: The color of the labels can be adjusted in two different ways. Either directly using the
label button by right clicking the button, or by right clicking on the molecule or category of interest
in the Project Tree.
• For proteins and nucleic acids, each residue is labeled with the PDB name and number.
• For ligands, each atom is labeled with the atom name as given in the input.
• For cofactors and water, one label is added with the name of the molecule.
• For atom groups including protein atoms, each protein residue is labeled with the PDB
name and number.
• For atom groups not including protein atoms, each atom is labeled with the atom name as
given in the input.
Hydrogen bonds
( )
The Show Hydrogen Bond visualization style may be applied to molecules and atom group entries
in the project tree. If this style is enabled for a project tree entry, hydrogen bonds will be shown
to all other currently visible objects. The hydrogen bonds are updated dynamically: if a molecule
is toggled off, the hydrogen bonds to it will not be shown.
It is possible to customize the color of the hydrogen bonds using the context menu.
Figure 15.10: The hydrogen bond visualization setting, with custom bond color.
• Selected Atoms. Creates an atom group containing exactly the selected atoms (those
indicated by brown spheres). If an entire molecule or residue is selected, this option is not
displayed.
• Selected Residue(s)/Molecules. Creates an atom group that includes all atoms in the
CHAPTER 15. 3D MOLECULE VIEWER 311
Figure 15.11: An atom group that has been highlighted by adding a unique visualization style.
selected residues (for entries in the protein and nucleic acid categories) and molecules (for
the other categories).
• Nearby Atoms. Creates an atom group that contains residues (for the protein and nucleic
acid categories) and molecules (for the other categories) within 5 Å of the selected atoms.
Only atoms from currently visible Project Tree entries are considered.
• Hydrogen Bonded Atoms. Creates an atom group that contains residues (for the protein
and nucleic acid categories) and molecules (for the other categories) that have hydrogen
bonds to the selected atoms. Only atoms from currently visible Project Tree entries are
considered.
• Double click to select. Click on an atom to select it. When you double click on an atom
that belongs to a residue in a protein or in a nucleic acid chain, the entire residue will be
selected. For small molecules, the entire molecule will be selected.
• Adding atoms to a selection. Holding down Ctrl while picking atoms, will pile up the atoms
in the selection. All atoms in a molecule or category from the Project Tree, can be added
to the "Current" selection by choosing "Add to Current Selection" in the context menu.
Similarly, entire molecules can be removed from the current selection via the context menu.
• Spherical selection. Hold down the shift-key, click on an atom and drag the mouse away
from the atom. Then a sphere centered on the atom will appear, and all atoms inside the
sphere, visualized with one of the all-atom representations will be selected. The status bar
(lower right corner) will show the radius of the sphere.
• Show Sequence. Another option is to select protein or nucleic acid entries in the Project Tree,
and click the "Show Sequence" button found below the Project Tree, see section 15.4.1. A
split-view will appear with a sequence editor for each of the sequence data types (Protein,
DNA, RNA) (figure 15.12). If you then select residues in the sequence view, the backbone
atoms of the selected residues will show up as the "Current" selection in the 3D view and
the Project Tree view. Notice that the link between the 3D view and the sequence editor is
lost if either window is closed, or if the sequence is modified.
CHAPTER 15. 3D MOLECULE VIEWER 312
• Align to Existing Sequence. If a single protein chain is selected in the Project Tree, the
"Align to Existing Sequence" button can be clicked, see section 15.4.2. This links the
protein sequence with a sequence or sequence alignment found in the Navigation Area. A
split-view appears with a sequence alignment where the sequence of the selected protein
chain is linked to the 3D structure, and atoms can be selected in the 3D view, just as for
the "Show Sequence" option.
Figure 15.12: The protein sequence in the split view is linked with the protein structure. This means
that when a part of the protein sequence is selected, the same region in the protein structure will
be selected.
• Nearby Atoms. Creates an atom group that contains residues (for the protein and nucleic
acid categories) and molecules (for the other categories) within 5 Å of the selected entries.
Only atoms from currently visible Project Tree entries are considered.
• Hydrogen Bonded Atoms. Creates an atom group that contains residues (for the protein
and nucleic acid categories) and molecules (for the other categories) that have hydrogen
bonds to the selected entries. Only atoms from currently visible Project Tree entries are
considered.
If a Binding Site Setup is present in the Project Tree (A Binding Site Setup could only be created
using the now discontinued CLC Drug Discovery Workbench), and entries from the Ligands or
Docking results categories are selected, two extra options are available under the header Create
Atom Group (Binding Site). For these options, atom groups are created considering all molecules
included in the Binding Site Setup, and thus not taking into account which Project Tree entries
are currently visible.
CHAPTER 15. 3D MOLECULE VIEWER 313
Zoom to fit
( )
The "Zoom to fit" button can be used to automatically move a region of interest into the center
of the screen. This can be done by selecting a molecule or category of interest in the Project Tree
view followed by a click on the "Zoom to fit" button ( ) at the bottom of the Project Tree view
(figure 15.13). Double-clicking an entry in the Project Tree will have the same effect.
Figure 15.13: The "Fit to screen" button can be used to bring a particular molecule or category of
molecules in focus.
• Show Sequence Select molecules which have sequences associated (Protein, DNA, RNA) in
the Project Tree, and click this button. Then, a split-view will appear with a sequence editor
for each of the sequence data types (Protein, DNA, RNA). This is described in section 15.4.1.
• Align to Existing Sequence Select a protein chain in the Project Tree, and click this button.
Then protein sequences and sequence alignments found in the Navigation Area, can be
linked with the protein structure. This is described in section 15.4.2.
• Transfer Annotations Select a protein chain in the Project Tree, that has been linked with a
sequence using either the "Show Sequence" or "Align to Existing Sequence" options. Then
it is possible to transfer annotations between the structure and the linked sequence. This
is described in section 15.4.3.
CHAPTER 15. 3D MOLECULE VIEWER 314
• Align Protein Structure This will invoke the dialog for aligning protein structures based on
global alignment of whole chains or local alignment of e.g. binding sites defined by atom
groups. This is described in section 15.5.
Property viewer
The Property viewer, found in the Side Panel, lists detailed information about the atoms that the
mouse hovers over. For all atoms the following information is listed:
• Residue For proteins and nucleic acids, the name and number of the residue the atom
belongs to is listed, and the chain name is displayed in parentheses.
• Name The particular atom name, if given in input, with the element type (Carbon, Nitrogen,
Oxygen...) displayed in parentheses.
• Charge The atomic charge as given in the input file. If charges are not given in the input
file, some charged chemical groups are automatically recognized and a charge assigned.
For atoms in molecules imported from a PDB file, extra information is given:
• Temperature Here is listed the b-factor assigned to the atom in the PDB file. The b-factor
is a measure of uncertainty or disorder in the atom position; the higher the number, the
higher the disorder.
• Occupancy For each atom in a PDB file, the occupancy is given. It is typically 1, but if
atoms are modeled in the PDB file, with no foundation in the raw data, the occupancy is 0.
If a residue or molecule has been resolved in multiple positions, the occupancy is between
0 and 1.
If an atom is selected, the Property view will be frozen with the details of the selected atom
shown. If then a second atom is selected (by holding down Ctrl while clicking), the distance
between the two selected atoms is shown. If a third atom is selected, the angle for the second
atom selected is shown. If a fourth atom is selected, the dihedral angle measured as the angle
between the planes formed by the three first and three last selected atoms is given.
If a molecule is selected in the Project Tree, the Property view shows information about this
molecule. Two measures are always shown:
Visualization settings
Under "Visualization" five options exist:
CHAPTER 15. 3D MOLECULE VIEWER 315
Figure 15.14: Selecting two, three, or four atoms will display the distance, angle, or dihedral angle,
respectively.
• Hydrogens Hydrogen atoms can be shown (Show all hydrogens), hidden (Hide all hydrogens)
or partially shown (Show only polar hydrogens).
• Fog "Fog" is added to give a sense of depth in the view. The strength of the fog can be
adjusted or it can be disabled.
• Clipping plane This option makes it possible to add an imaginary plane at a specified
distance along the camera's line of sight. Only objects behind this plane will be drawn. It is
possible to clip only surfaces, or to clip surfaces together with proteins and nucleic acids.
Small molecules, like ligands and water molecules, are never clipped.
• 3D projection The view is opened up towards the viewer, with a "Perspective" 3D projection.
The field of view of the perspective can be adjusted, or the perspective can be disabled by
selecting an orthographic 3D projection.
• Coloring The background color can be selected from a color palette by clicking on the
colored box.
Snapshots of the molecule visualization To save the current view as a picture, right-click in the
View Area and select "File" and "Export Graphics". Another way to save an image is by pressing
the "Graphics" button in the Workbench toolbar ( ). Next, select the location where you wish
to save the image, select file format (PNG, JPEG, or TIFF), and provide a name, if you wish to use
another name than the default name.
You can also save the current view directly on data with a custom name, so that it can later be
applied (see section 4.6).
Figure 15.15: Protein chain sequences and DNA sequences are shown in separate views.
Figure 15.16: Select a single protein chain in the Project Tree and invoke "Align to Existing
Sequence".
When the link is established, selections on the linked sequence in the sequence editor will
create atom selections in the 3D view, and it is possible to transfer annotations between the
linked sequence and the 3D protein chain (see section 15.4.3). Note that the link will be broken
if either the sequence or the 3D protein chain is modified.
Two tips if the link is to a sequence in an alignment:
1. Read about how to change the layout of sequence alignments in section 16.2
2. It is only annotations present on the sequence linked to the 3D view that can be transferred
to atom groups on the structure. To transfer sequence annotations from other sequences
in the alignment, first copy the annotations to the sequence in the alignment that is linked
to the structure (see figure 15.19 and section 16.3).
Figure 15.17: Select a single protein chain in the Project Tree and invoke "Transfer Annotations".
Figure 15.18: The Transfer Annotations dialog allow you to select annotations listed in the two
tables, and copy them from structure to sequence or vice versa.
Figure 15.19: Copy annotations from sequences in the alignment to the sequence linked to the 3D
view.
• Select reference (protein chain or atom group) This drop-down menu shows all the protein
CHAPTER 15. 3D MOLECULE VIEWER 320
chains and residue-containing atom groups in the current Molecule Project. If an atom
group is selected, the structural alignment will be optimized in that area. The 'All chains
from Molecule Project option will create a global alignment to all protein chains in the
project, fitting e.g. a dimer to a dimer.
• Molecule Projects with molecules to be aligned One or more Molecule Projects containing
protein chains may be selected.
• Output options The default output is a single Molecule Project containing all the input
projects rotated onto the coordinate system of the reference. Several alignment statistics,
including the RMSD, TM-score, and sequence identity, are added to the History of the
output Molecule Project. Additionally, a sequence alignments of the aligned structures
may be output, with the sequences linked to the 3D structure view.
Initial global alignment The 1A29 project is opened and the Align Protein Structure dialog is
filled out as in figure 15.20. Selecting "All chains from 1A29" tells the aligner to make the best
possible global alignment, favoring no particular region. The output of the alignment is shown
in figure 15.21. The blue chain is from 1A29, the brown chain is the corresponding calmodulin
chain from 4G28 (a calmodulin-binding chain from the 4G28 file has been hidden from the view).
Because calmodulin is so flexible, it is not possible to align both of its domains (enclosed in
black boxes) at the same time. A good global alignment would require the brown protein to be
CHAPTER 15. 3D MOLECULE VIEWER 321
translated in one direction to match the N-terminal domain, and in the other direction to match
the C-terminal domain (see black arrows).
Figure 15.21: Global alignment of two calmodulin structures (blue and brown). The two domains
of calmodulin (shown within black boxes) can undergo large changes in relative orientation. In
this case, the different orientation of the domains in the blue and brown structures makes a good
global alignment impossible: the movement required to align the brown structure onto the blue
is shown by arrows -- as the arrows point in opposite directions, improving the alignment of one
domain comes at the cost of worsening the alignment of the other.
Focusing the alignment on the N-terminal domain To align only the N-terminal domain, we
return to the 1A29 project and select the Show Sequence action from beneath the Project
Tree. We highlight the first 62 residues, then convert them into an atom group by right-clicking
on the "Current" selection in the Project Tree and choosing "Create Group from Selection"
(figure 15.22). Using the new atom group as the reference in the alignment dialog leads to
the alignment shown in figure 15.23. In addition to the original input proteins, the output now
includes two Atom Groups, which contain the atoms on which the alignment was focused. The
History of the output Molecule Project shows that the alignment has 0.9 Å RMSD over the 62
residues.
Aligning a binding site Two bound calcium atoms, one from each calmodulin structure, are
shown in the black box of figure 15.23. We now wish to make an alignment that is as good as
possible about these atoms so as to compare the binding modes. We return to the 1A29 project,
right-click the calcium atom from the cofactors list in the Project Tree and select "Create Nearby
Atoms Group". Using the new atom group as the reference in the alignment dialog leads to the
alignment shown in figure 15.24.
CHAPTER 15. 3D MOLECULE VIEWER 322
Figure 15.22: Creation of an atom group containing the N-terminal domain of calmodulin.
Figure 15.23: Alignment of the same two calmodulin proteins as in figure 15.21, but this time with
a focus on the N-terminal domain. The blue and brown structures are now well-superimposed in
the N-terminal region. The black box encloses two calcium atoms that are bound to the structures.
CHAPTER 15. 3D MOLECULE VIEWER 323
Figure 15.24: Alignment of the same two calmodulin domains as in figure 15.21, but this time with
a focus on the calcium atom within the black box of figure 15.23. The calcium atoms are less than
1 Å apart -- compatible with thermal motion encoded in the atoms' temperature factors.
CHAPTER 15. 3D MOLECULE VIEWER 324
1X 1
TM-score = 2
L 1 + di
i d(L)
where i runs over the aligned pairs of residues, di is the distance between the ith such pair,
and d(L) is a normalization term that approximates the average distance between two randomly
chosen points in a globular protein of length L [Zhang and Skolnick, 2004]. A perfect alignment
has a TM-score of 1.0, and two proteins with a TM-score >0.5 are often said to show structural
homology [Xu and Zhang, 2010].
The Align Protein Structure Algorithm attempts to find the structure alignment with the highest
TM-score. This problem reduces to finding a sequence alignment that pairs residues in a way that
results in a high TM-score. Several sequence alignments are tried including an alignment with
the BLOSUM62 matrix, an alignment of secondary structure elements, and iterative refinements
of these alignments.
The Align Protein Structure Algorithm is also capable of aligning entire protein complexes. To do
this, it must determine the correct pairing of each chain in one complex with a chain in the other.
This set of chain pairings is determined by the following procedure:
1. Make structure alignments between every chain in one complex and every chain in the
other. Discard pairs of chains that have a TM-score of < 0.4
2. Find all pairs of structure alignments that are consistent with each other i.e. are achieved
by approximately the same rotation
3. Use a heuristic to combine consistent pairs of structure alignments into a single alignment
The heuristic used in the last step is similar to that of MM-align [Mukherjee and Zhang, 2009],
whereas the first two steps lead to both a considerable speed up and increased accuracy. The
alignment of two 30S ribosome subunits, each with 20 protein chains, can be achieved in less
than a minute (PDB codes 2QBD and 1FJG).
When a PDB file with biomolecule information available has been either downloaded directly to
the workbench using the Search for PDB Structures at NCBI or imported using Import Molecules
with 3D Coordinates, the information can be used to generate biomolecule structures in CLC Main
Workbench.
The "Generate Biomolecule" dialog is invoked from the Side Panel of a Molecule Project
(figure 15.25). The button ( ) is found in the Structure tools section below the Project Tree.
Figure 15.25: The Generate Biomolecule dialog lists all possibilities for biomolecules, as given
in the PDB files imported to the Molecule Project. In this case, only one biomolecule option is
available. The Generate Biomolecule button that invokes the dialog can be seen in the bottom right
corner of the picture.
There can be more than one biomolecule description available from the imported PDB files. The
biomolecule definitions have either been assigned by the crystallographer solving the protein
structure (Author assigned = "Yes") or suggested by a software prediction tool (Author assigned
= "No"). The third column lists which protein chains are involved in the biomolecule, and how
many copies will be made.
Select the preferred biomolecule definition and click OK.
A new Molecule Project will open containing the molecules involved in the selected biomolecule
(example in figure 15.26). If required by the biomolecule definition, copies are made of
protein chains and other molecules, and the copies are positioned according to the biomolecule
information given in the PDB file. The copies will in that case have "s1", "s2", "s3" etc. at the
end of the molecule names seen in the Project Tree.
If the proteins in the Molecule Project already are present in their biomolecule form, the message
"The biological unit is already shown" is displayed, when the "Generate Biomolecule" button is
clicked.
If the PDB files imported or downloaded to a Molecule Project did not hold biomolecule information,
the message "No biological unit is associated with this Molecule Project" is shown, when the
Generate Biomolecule button is clicked.
CHAPTER 15. 3D MOLECULE VIEWER 326
Figure 15.26: One of the biomolecules that can be generated after downloading the PDB 2R9R to
CLC Main Workbench. It is a voltage gated potassium channel.
Chapter 16
Sequence alignment
Contents
16.1 Create an alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 327
16.1.1 Gap costs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 328
16.1.2 Fast or accurate alignment algorithm . . . . . . . . . . . . . . . . . . . . 329
16.1.3 Aligning alignments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 330
16.1.4 Fixpoints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 330
16.2 View alignments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 332
16.2.1 Bioinformatics explained: Sequence logo . . . . . . . . . . . . . . . . . . 336
16.3 Edit alignments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 337
16.3.1 Realignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 340
16.4 Join alignments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 342
16.5 Pairwise comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343
16.5.1 The pairwise comparison table . . . . . . . . . . . . . . . . . . . . . . . 344
16.5.2 Bioinformatics explained: Multiple alignments . . . . . . . . . . . . . . . 346
CLC Main Workbench can align nucleotides and proteins using a progressive alignment algorithm
(see section 16.5.2.
This chapter describes how to use the program to align sequences, and alignment algorithms in
more general terms.
327
CHAPTER 16. SEQUENCE ALIGNMENT 328
• Gap extension cost. The price for every extension past the initial gap.
If you expect a lot of small gaps in your alignment, the Gap open cost should equal the Gap
extension cost. On the other hand, if you expect few but large gaps, the Gap open cost should
be set significantly higher than the Gap extension cost.
However, for most alignments it is a good idea to make the Gap open cost quite a bit higher
than the Gap extension cost. The default values are 10.0 and 1.0 for the two parameters,
respectively.
• End gap cost. The price of gaps at the beginning or the end of the alignment. One of the
advantages of the CLC Main Workbench alignment method is that it provides flexibility in
the treatment of gaps at the ends of the sequences. There are three possibilities:
Free end gaps. Any number of gaps can be inserted in the ends of the sequences
without any cost.
CHAPTER 16. SEQUENCE ALIGNMENT 329
Cheap end gaps. All end gaps are treated as gap extensions and any gaps past 10
are free.
End gaps as any other. Gaps at the ends of sequences are treated like gaps in any
other place in the sequences.
When aligning a long sequence with a short partial sequence, it is ideal to use free end gaps,
since this will be the best approximation to the situation. The many gaps inserted at the ends
are not due to evolutionary events, but rather to partial data.
Many homologous proteins have quite different ends, often with large insertions or deletions. This
confuses alignment algorithms, but using the Cheap end gaps option, large gaps will generally
be tolerated at the sequence ends, improving the overall alignment. This is the default setting of
the algorithm.
Finally, treating end gaps like any other gaps is the best option when you know that there are no
biologically distinct effects at the ends of the sequences.
Figures 16.3 and 16.4 illustrate the differences between the different gap scores at the sequence
ends.
Figure 16.3: The first 50 positions of two different alignments of seven calpastatin sequences. The
top alignment is made with cheap end gaps, while the bottom alignment is made with end gaps
having the same price as any other gaps. In this case it seems that the latter scoring scheme gives
the best result.
• Fast (less accurate). Use an optimized alignment algorithm that is very fast. This is
particularly useful for data sets with very long sequences.
• Slow (very accurate). The recommended choice unless the processing time is too long.
Both algorithms use progressive alignment. The faster algorithm builds the initial tree by doing
more approximate pairwise alignments than the slower option.
CHAPTER 16. SEQUENCE ALIGNMENT 330
Figure 16.4: The alignment of the coding sequence of bovine myoglobin with the full mRNA of
human gamma globin. The top alignment is made with free end gaps, while the bottom alignment
is made with end gaps treated as any other. The yellow annotation is the coding sequence in both
sequences. It is evident that free end gaps are ideal in this situation as the start codons are aligned
correctly in the top alignment. Treating end gaps as any other gaps in the case of aligning distant
homologs where one sequence is partial leads to a spreading out of the short sequence as in the
bottom alignment.
• Leave this box unchecked when aligning additional sequences to the original alignment.
Equal sized gaps may be inserted in all sequences of the original alignment to accommodate
the alignment of the new sequences (figure 16.5), but apart from this, positions in the
original alignment are fixed.
• Check this box to realign the sequences in the alignment provided as input. This can be
useful, for example, if you wish to realign using different gap costs than used originally.
16.1.4 Fixpoints
To force particular regions of an alignment to be aligned to each other, there are two steps:
2. Check the "Use fixpoints" option when launching the Create Alignment tool.
Figure 16.5: The original alignment is shown at the top. That alignment and a single additional
sequence, with four Xs added for illustrative purposes, were used as input to Create Alignment.
The "Redo alignment" option was left unchecked. The resulting alignment is shown at the bottom.
Gaps have been added, compared to the original alignment, to accommodate the new sequence.
All other positions are aligned as they were in the original alignment.
This will add an annotation of type "Alignment fixpoint", with name "Fixpoint" to the sequence
(figure 16.6).
Regions with fixpoint annotations with the same name are aligned to each other. Where there are
multiple fixpoints of the same name on sequences, the first fixpoints on each sequence will be
aligned to each other, the second on each sequence will be aligned to each other, and so on.
To adjust the name of a fixpoint annotation:
Right-click the Fixpoint annotation | Edit Annotation ( ) | Type the name in the
'Name' field
An example where assigning different names to fixpoints is useful: Given three sequences, A,
B and C, where A and B each have one copy of a domain while sequence C has two copies
of the domain, you can force sequence A to align to the first copy of the domain in sequence
C and sequence B to align to the second copy of the domain in sequence C by naming the
fixpoints accordingly. E.g. if the fixpoints in sequence C were named 'fp1' and 'fp2', the fixpoint
in sequence A was named 'fp1' and the fixpoint in sequence B was named 'fp2', then when
these sequences are aligned using fixpoints, the fixpoint in sequence A would be aligned to the
first copy of the domain in sequence C, while the fixpoint in sequence B would be aligned to the
second copy of the domain in sequence C.
The result of an alignment using fixpoints is shown in figure 16.7.
CHAPTER 16. SEQUENCE ALIGNMENT 332
Figure 16.6: Select a region and right-click on it to see the option to set a fixpoint. The second
sequence in the list already has a Fixpoint annotation.
Figure 16.7: Fixpoints have been added to 2 sequences in an alignment, where the first 3
sequences are very similar to each other and the last 3 sequences are very similar to each other
(top). After realigning using just these 2 fixpoints (bottom), the alignment now shows clearly the 2
groups of sequences.
Sequence layout
In the side panel tab Sequence layout the option Alignments on top supports moving the aligned
CHAPTER 16. SEQUENCE ALIGNMENT 333
sequences relative to other elements shown in the alignment view. When checked, the alignment
is shown at the top of the view, when unchecked the alignment is shown underneath other
included summary information such as Consensus, Conservation and Sequence logo.
Nucleotide info
In the side panel tab Nucleotide info under Translation, there is an extra checkbox: Relative to
top sequence. Checking this box will make the reading frames for the translation align with the
top sequence so that you can compare the effect of nucleotide differences on the protein level.
Alignment info
The entire side panel tab Alignment info is specific to alignments. Each of the options in the
Alignment info relate to each column in the alignment.
The data points for graph representations can be exported (see section 8.3).
Consensus Shows a consensus sequence at the bottom of the alignment. The consensus
sequence is based on every single position in the alignment and reflects an artificial sequence
which resembles the sequence information of the alignment, but only as one single sequence.
If all sequences of the alignment is 100% identical the consensus sequence will be identical to
all sequences found in the alignment. If the sequences of the alignment differ the consensus
sequence will reflect the most common sequences in the alignment. Parameters for adjusting
the consensus sequences are described below.
• Limit This option determines how conserved the sequences must be in order to agree on
a consensus. Here you can also choose IUPAC which will display the ambiguity code when
there are differences between the sequences. For example, an alignment with A and a G at
the same position will display an R in the consensus line if the IUPAC option is selected.
The IUPAC codes can be found in section F and E. Please note that the IUPAC codes are
only available for nucleotide alignments.
• No gaps Checking this option will not show gaps in the consensus.
• Ambiguous symbol Select how ambiguities should be displayed in the consensus line (as
N, ?, *, . or -). This option has no effect if IUPAC is selected in the Limit list above.
The Consensus Sequence can be opened in a new view, simply by right-clicking the Consensus
Sequence and click Open Consensus in New View.
Conservation Displays the level of conservation at each position in the alignment. The
conservation shows the conservation of all sequence positions. The height of the bar, or the
gradient of the color reflect how conserved that particular position is in the alignment. If one
position is 100% conserved the bar will be shown in full height, and it is colored in the color
specified at the right side of the gradient slider.
CHAPTER 16. SEQUENCE ALIGNMENT 334
• Foreground color Colors the letters using a gradient, where the right side color is used
for highly conserved positions and the left side color is used for positions that are less
conserved.
• Background color. Sets a background color of the residues using a gradient in the same
way as described above.
• Graph Displays the conservation level as a graph at the bottom of the alignment. The bar
(default view) show the conservation of all sequence positions. The height of the graph
reflects how conserved that particular position is in the alignment. If one position is 100%
conserved the graph will be shown in full height.
Gap fraction Which fraction of the sequences in the alignment that have gaps. The gap fraction
is only relevant if there are gaps in the alignment.
• Foreground color Colors the letter using a gradient, where the left side color is used if there
are relatively few gaps, and the right side color is used if there are relatively many gaps.
• Background color Sets a background color of the residues using a gradient in the same
way as described above.
• Graph Displays the gap fraction as a graph at the bottom of the alignment.
Sequence logo A sequence logo displays the frequencies of residues at each position in an
alignment. This is presented as the relative heights of letters, along with the degree of sequence
conservation as the total height of a stack of letters, measured in bits of information. The vertical
scale is in bits, with a maximum of 2 bits for nucleotides and approximately 4.32 bits for amino
acid residues. See section 16.2.1 for more details.
CHAPTER 16. SEQUENCE ALIGNMENT 335
• Foreground color Color the residues using a gradient according to the information content
of the alignment column. Low values indicate columns with high variability whereas high
values indicate columns with similar residues.
• Background color Sets a background color of the residues using a gradient in the same
way as described above.
Positional stats
The side panel tab Positional stats provides site specific information about the alignment. Hover
the mouse cursor over a position in the alignment or make a selection to populate the tab with
information (figure 16.8).
Figure 16.8: Contents of the Positional stats tab when a single sequence is selected (Left) and
when three sequences are selected (Right). Note that the side panel can be dragged into the
alignment view.
When one position is selected, the information provided is calculated from all the sequences at
the position. If more sequences at the same position are selected, the information is calculated
for the selected sequences only.
The following information is provided:
• Pairwise % identity Average percent identity. All pairs of bases at the same position
are compared. The number of identical pairs is counted and divided by the total number
CHAPTER 16. SEQUENCE ALIGNMENT 336
of pairs. The count of ambiguity characters is scaled to the number of bases they can
represent, for example a G compared to an R (A or G) is given the value 0.5.
Example calculation for an alignment with the nucleotides A, A and G in the tested position:
There are three pairwise comparisons, A to A = 1, A to G = 0, and A to G = 0. The pairwise
% identity is then 1/3.
Figure 16.9: Ungapped sequence alignment of eleven E. coli sequences defining a start codon.
The start codons start at position 1. Below the alignment is shown the corresponding sequence
logo. As seen, a GTG start codon and the usual ATG start codons are present in the alignment. This
can also be visualized in the logo at position 1.
N
X
Rseq = Smax − Sobs = log2 N − − pn log2 pn
n=1
select one or more gaps or residues in the alignment | drag the selection to move
This can be done both for single sequences, but also for multiple sequences by making a
selection covering more than one sequence. When you have made the selection, the mouse
pointer turns into a horizontal arrow indicating that the selection can be moved (see figure 16.10).
Note! Residues can only be moved when they are next to a gap.
Figure 16.10: Moving a part of an alignment. Notice the change of mouse pointer to a horizontal
arrow.
Insert gaps The placement of gaps in the alignment can be changed by modifying the parameters
when creating the alignment. However, gaps can also be added manually after the alignment is
created.
To insert extra gaps:
select a part of the alignment | right-click the selection | Add gaps before/after
If you have made a selection covering five residues for example, a gap of five will be inserted.
In this way you can easily control the number of gaps to insert. Gaps will be inserted in the
sequences that you selected. If you make a selection in two sequences in an alignment, gaps will
be inserted into these two sequences. This means that these two sequences will be displaced
compared to the other sequences in the alignment.
Delete residues and gaps Residues or gaps can be deleted for individual sequences or for the
whole alignment. For individual sequences:
select the part of the sequence you want to delete | right-click the selection | Edit
Selection ( ) | Delete the text in the dialog | Replace
The selection shown in the dialog will be replaced by the text you enter. If you delete the text,
the selection will be replaced by an empty text, i.e. deleted.
In order to delete entire columns:
manually select the columns to delete | right-click the selection | click 'Delete
Selection'
This will display a dialog listing all the sequences in the alignment. Next to each sequence is a
checkbox which is used for selecting which sequences the annotation should be copied to. Click
Copy to copy the annotation.
If you wish to copy all annotations on the sequence, click the Copy All Annotations to other
Sequences.
Copied/transferred annotations will contain the same qualifier text as the original, i.e., the text
is not updated. As an example, if the annotation contains 'translation' as qualifier text, this
translation will be copied to the new sequence and will thus reflect the translation of the original
sequence, and not the new sequence which may differ.
Move sequences up and down Sequences can be moved up and down in the alignment:
drag the name of the sequence up or down
When you move the mouse pointer over the label, the pointer will turn into a vertical arrow
indicating that the sequence can be moved.
The sequences can also be sorted automatically to let you save time moving the sequences
around. To sort the sequences alphabetically:
Right-click the name of a sequence | Sort Sequences Alphabetically
If you change the Sequence name (in the Sequence Layout view preferences), you will have to
ask the program to sort the sequences again.
If you have one particular sequence that you would like to use as a reference sequence, it can be
useful to move this to the top. This can be done manually, but it can also be done automatically:
Right-click the name of a sequence | Move Sequence to Top
The sequences can also be sorted by similarity, grouping similar sequences together:
Right-click the name of a sequence | Sort Sequences by Similarity
Delete, rename and add sequences Sequences can be removed from the alignment by right-
clicking the label of a sequence:
right-click label | Delete Sequence
If you wish to delete several sequences, you can check all the sequences, right-click and choose
Delete Marked Sequences. To show the checkboxes, you first have to click the Show Selection
Boxes in the Side Panel.
A sequence can also be renamed:
right-click label | Rename Sequence
This will show a dialog, letting you rename the sequence. This will not affect the sequence that
the alignment is based on.
Extra sequences can be added to the alignment by creating a new alignment where you select
the current alignment and the extra sequences (see section 16.1).
The same procedure can be used for joining two alignments.
CHAPTER 16. SEQUENCE ALIGNMENT 340
16.3.1 Realignment
This section describes realigning parts of an existing alignment. To realign an entire align-
ment, consider using the "Redo alignment" option of the Create Alignment tool, described in
section 16.1.3
Examples where realigning part of an alignment can be helpful include:
• Adjusting the number of gaps If a region has more gaps than is useful, select the region
of interest and realign using a higher gap cost.
• Combine with fixpoints When you have an alignment where two residues are not aligned
although they should have been, you can set an alignment fixpoint on each of those
residues. and then realign the section of interest using those fixpoints, as described in
section 16.1.4. This should result in the two residues being aligned, and everything in the
selected region around them being adjusted to accommodate that change.
Selecting a region
Click and drag to select the regions of interest. For small regions in a small number of sequences,
this may be easiest while zoomed in fully, such that each residue is visible. For realigning entire
sequences, zooming out fully may be helpful.
As selection involves clicking and dragging the mouse, all regions of interest must be contiguous.
That is, you must be able to drag over the relevant regions in a single motion. This may mean
gathering particular sequences into a block. There are two ways to achieve this:
1. Click on the name of an individual sequence and drag it to the desired location in the
alignment. Do this with each relevant sequence until all those of interest are placed as
desired.
2. Check the option "Show selection boxes" in the Alignment settings section of the side
panel settings (figure 16.11). Click in the checkbox next to the names of the sequences
you wish to select. Then right-click on the name of one of the sequences and choose the
option "Sort Sequences by Marked Status". This will bring all selected sequences to the
top of the alignment.
If you have many sequences to select, it can be easiest to select the few that are not
of interest, and then invert the selection by right-clicking on any of the checkboxes and
choosing the option "Invert All Marks".
You can then easily click-and-drag your selection of sequences (this is made easier if you select
the "No wrap" setting in the right-hand side panel). By right-clicking on the selected sequences
CHAPTER 16. SEQUENCE ALIGNMENT 341
(not on their names, but on the sequences themselves as seen in figure 16.12), you can choose
the option "Open selection in a new view", with the ability to run any relevant tool on that
sub-alignment.
Figure 16.12: Open the selected sequences in a new window to realign them.
If you have selected some alignments before launching the tool, they will be pre-selected in the
Selected Elements window of the dialog. Use the arrows to add or remove alignments from the
selected elements. In this example seven alignments are selected. Each alignment represents
one gene that have been sequenced from five different bacterial isolates from the genus Nisseria.
Clicking Next opens the dialog shown in figure 16.15.
To adjust the order of concatenation, click the name of one of the alignments, and move it up or
down using the arrow buttons.
The result is seen in the lower part of figure 16.16.
CHAPTER 16. SEQUENCE ALIGNMENT 343
Figure 16.16: The upper part of the figure shows two of the seven alignments for the genes "abcZ"
and "aroE" respectively. Each alignment consists of sequences from one gene from five different
isolates. The lower part of the figure shows the result of "Join Alignments". Seven genes have been
joined to an artificial gene fusion, which can be useful for construction of phylogenetic trees in
cases where only fractions of a genome is available. Joining of the alignments results in one row
for each isolate consisting of seven fused genes. Each fused gene sequence corresponds to the
number of uniquely named sequences in the joined alignments.
How alignments are joined Alignments are joined by considering the sequence names in the
individual alignments. If two sequences from different alignments have identical names, they are
considered to have the same origin and are thus joined. Consider the joining of the alignments
shown in figure 16.16 "Alignment of isolates_abcZ", "Alignment of isolates_aroE", "Alignment of
isolates_adk" etc. If a sequence with the same name is found in the different alignments (in this
case the name of the isolates: Isolate 1, Isolate 2, Isolate 3, Isolate 4, and Isolate 5), a joined
alignment will exist for each sequence name. In the joined alignment the selected alignments
will be fused with each other in the order they were selected (in this case the seven different
genes from the five bacterial isolates). Note that annotations have been added to each individual
sequence before aligning the isolates for one gene at the time in order to make it clear which
sequences were fused to each other.
• Gaps Calculates the number of alignment positions where one sequence has a gap and the
other does not.
• Identities Calculates the number of identical alignment positions to overlapping alignment
positions between the two sequences. An overlapping alignment position is a position
where at least one residue is present, rather than only gaps.
• Differences Calculates the number of alignment positions where one sequence is different
from the other. This includes gap differences as in the Gaps comparison.
• Distance Calculates the Jukes-Cantor distance between the two sequences. This number
is given as the Jukes-Cantor correction of the proportion between identical and overlapping
alignment positions between the two sequences.
• Percent identity Calculates the percentage of identical residues in alignment positions to
overlapping alignment positions between the two sequences.
values that appears when you slide the cursor reflect the percentage of the range of values in
the table, and not absolute values.
The following settings are present in the side panel:
• Contents
Upper comparison Selects the comparison to show in the upper triangle of the table.
Upper comparison gradient Selects the color gradient to use for the upper triangle.
Lower comparison Selects the comparison to show in the lower triangle. Choose the
same comparison as in the upper triangle to show all the results of an asymmetric
comparison.
Lower comparison gradient Selects the color gradient to use for the lower triangle.
Diagonal from upper Use this setting to show the diagonal results from the upper
comparison.
Diagonal from lower Use this setting to show the diagonal results from the lower
comparison.
No Diagonal. Leaves the diagonal table entries blank.
• Layout
Lock headers Locks the sequence labels and table headers when scrolling the table.
Sequence label Changes the sequence labels.
• Text format
Text size Changes the size of the table and the text within it.
Font Changes the font in the table.
Bold Toggles the use of boldface in the table.
CHAPTER 16. SEQUENCE ALIGNMENT 346
• Annotation of functional domains, which may only be known for a subset of the sequences,
can be transferred to aligned positions in other un-annotated sequences.
• Conserved regions in the alignment can be found which are prime candidates for holding
functionally important sites.
Figure 16.20: The tabular format of a multiple alignment of 24 Hemoglobin protein sequences.
Sequence names appear at the beginning of each row and the residue position is indicated by
the numbers at the top of the alignment columns. The level of sequence conservation is shown
on a color scale with blue residues being the least conserved and red residues being the most
conserved.
Whereas the optimal solution to the pairwise alignment problem can be found in reasonable
time, the problem of constructing a multiple alignment is much harder.
The first major challenge in the multiple alignment procedure is how to rank different alignments,
i.e., which scoring function to use. Since the sequences have a shared history they are correlated
through their phylogeny and the scoring function should ideally take this into account. Doing so
is, however, not straightforward as it increases the number of model parameters considerably.
It is therefore commonplace to either ignore this complication and assume sequences to be
unrelated, or to use heuristic corrections for shared ancestry.
The second challenge is to find the optimal alignment given a scoring function. For pairs of
sequences this can be done by dynamic programming algorithms, but for more than three
sequences this approach demands too much computer time and memory to be feasible.
A commonly used approach is therefore to do progressive alignment [Feng and Doolittle, 1987]
where multiple alignments are built through the successive construction of pairwise alignments.
These algorithms provide a good compromise between time spent and the quality of the resulting
alignment
The method has the inherent drawback that once two sequences are aligned, there is no way
of changing their relative alignment based on the information that additional sequences may
contribute later in the process. It is therefore important to make the best possible alignments
early in the procedure, to avoid accumulating errors. To accomplish this, a tree of the sequences
is usually constructed to guide the progressive alignment algorithm. And to overcome the problem
of a time consuming tree construction step, we are using word matching, a method that group
sequences in a very efficient way, saving much time, without reducing the resulting alignment
accuracy significantly.
Our algorithm (developed by QIAGEN Aarhus) has two speed settings: "standard" and "fast".
The standard method makes a fairly standard progressive alignment using the fast method of
generating a guide tree. When aligning two alignments to each other, two matching columns are
scored as the average of all the pairwise scores of the residues in the columns. The gap cost is
affine, allowing a different cost for the first gapped position and for the consecutive gaps. This
ensures that gaps are not spread out too much.
The fast method of alignment uses the same overall method, except that it uses fixpoints in
the alignment algorithm based on short subsequences that are identical in the sequences that
are being aligned. This allows similar sequences to be aligned much more efficiently, without
reducing accuracy very much.
Chapter 17
Phylogenetic trees
Contents
17.1 K-mer Based Tree Construction . . . . . . . . . . . . . . . . . . . . . . . . . 350
17.2 Create tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 351
17.3 Model Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 352
17.4 Maximum Likelihood Phylogeny . . . . . . . . . . . . . . . . . . . . . . . . . 354
17.4.1 Bioinformatics explained . . . . . . . . . . . . . . . . . . . . . . . . . . 357
17.5 Tree Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 362
17.5.1 Minimap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363
17.5.2 Tree layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363
17.5.3 Node settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365
17.5.4 Label settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365
17.5.5 Background settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 367
17.5.6 Branch layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 368
17.5.7 Bootstrap settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 368
17.5.8 Visualizing metadata . . . . . . . . . . . . . . . . . . . . . . . . . . . . 369
17.5.9 Node right click menu . . . . . . . . . . . . . . . . . . . . . . . . . . . . 369
17.6 Metadata and phylogenetic trees . . . . . . . . . . . . . . . . . . . . . . . . 371
17.6.1 Table Settings and Filtering . . . . . . . . . . . . . . . . . . . . . . . . . 373
17.6.2 Add or modify metadata on a tree . . . . . . . . . . . . . . . . . . . . . 373
17.6.3 Undefined metadata values on a tree . . . . . . . . . . . . . . . . . . . 374
17.6.4 Selection of specific nodes . . . . . . . . . . . . . . . . . . . . . . . . . 375
348
CHAPTER 17. PHYLOGENETIC TREES 349
The viewer for visualizing and working with phylogenetic trees allows the user to create high-quality,
publication-ready figures of phylogenetic trees. Large trees can be explored in two alternative tree
layouts; circular and radial. The viewer supports importing, editing and visualization of metadata
associated with nodes in phylogenetic trees.
Below is an overview of the main features of the phylogenetic tree editor. Further details can be
found in the subsequent sections.
Main features of the phylogenetic tree editor:
• Visualization of metadata though e.g. node color, node shape, branch color, etc.
• Minimap navigation.
• Curved edges.
For a given set of aligned sequences (see section 16.1) it is possible to infer their evolutionary
relationships. In CLC Main Workbench this may be done either by using a distance based
method or by using maximum likelihood (ML) estimation, which is a statistical approach (see
Bioinformatics explained in section 17.4.1). Both approaches generate a phylogenetic tree.
Three tools are available for generating phylogenetic trees:
• K-mer Based Tree Construction ( ) Is a distance-based method that can create trees
based on multiple single sequences. K-mers are used to compute distance matrices for
distance-based phylogenetic reconstruction tools such as neighbor joining and UPGMA (see
section 17.4.1). This method is less precise than the Create Tree tool but it can cope
with a very large number of long sequences as it does not require a multiple alignment.
The k-mer based tree construction tool is especially useful for whole genome phylogenetic
reconstruction where the genomes are closely releated, i.e. they differ mainly by SNPs and
contain no or few structural variations.
• Maximum Likelihood Phylogeny ( ) The most advanced and time consuming method of
the three mentioned. The maximum likelihood tree estimation is performed under the
assumption of one of five substitution models: the Jukes-Cantor, the Kimura 80, the HKY
and the GTR (also known as the REV model) models (see section 17.4 for further information
CHAPTER 17. PHYLOGENETIC TREES 350
about the models). Prior to using the Maximum Likelihood Phylogeny tool for creating a
phylogenetic tree it is recommended to run the Model Testing tool (see section 17.3) in
order to identify the best suitable models for creating a tree.
• Create Tree ( ) Is a tool that uses distance estimates computed from a multiple
sequence alignment to create a tree. The user can select whether to use Jukes-Cantor
distance correction or Kimura distance correction (Kimura 80 for nucleotides/Kimura
protein for proteins) in combination with either the neighbor joining or UPGMA method (see
section 17.4.1).
Figure 17.1: Select sequences needed for creating a tree with K-mer based tree construction.
Next, select the construction method, specify the k-mer length and select a distance measure
for tree construction (figure 17.2):
• Tree construction
Tree construction method The user is asked to specify which distance-based method
to use for tree construction. There are two options (see section 17.4.1):
∗ The UPGMA method. Assumes constant rate of evolution.
∗ The Neighbor Joining method. Well suited for trees with varying rates of evolution.
• K-mer settings
CHAPTER 17. PHYLOGENETIC TREES 351
Figure 17.2: Select the construction method, and specify the k-mer length and a distance measure.
K-mer length (the value k) Allows specification of the k-mer length, which can be a
number between 3 and 50.
Distance measure The distance measure is used to compute the distances between
two counts of k-mers. Three options exist: Euclidian squared, Mahalanobis, and
Fractional common K-mer count. See section 17.4.1 for further details.
If an alignment was selected before the tool was launched, that alignment will be listed in the
Selected Elements window of the dialog. Use the arrows to add or remove elements from the
Navigation Area. Click Next to adjust parameters.
Figure 17.4 shows the parameters that can be set for this distance-based tree creation:
• Bootstrapping.
Perform bootstrap analysis. To evaluate the reliability of the inferred trees, CLC Main
Workbench allows the option of doing a bootstrap analysis (see section 17.4.1). A
bootstrap value will be attached to each node, and this value is a measure of the
confidence in the subtree rooted at the node. The number of replicates used in the
bootstrap analysis can be adjusted in the wizard. The default value is 100 replicates
which is usually enough to distinguish between reliable and unreliable nodes in the
tree. The bootstrap value assigned to each inner node in the output tree is the
percentage (0-100) of replicates which contained the same subtree as the one rooted
at the inner node.
To do model testing:
Tools | Alignments and Trees ( )| Model Testing ( )
Select the alignment that you wish to use for the tree construction (figure 17.5):
A base tree (a guiding tree) is required in order to be able to determine which model(s)
would be the most appropriate to use to make the best possible phylogenetic tree from a
specific alignment. The topology of the base tree is used in the hierarchical likelihood ratio
test (hLRT), and the base tree is used as starting point for topology exploration in Bayesian
information criterion (BIC), Akaike information criterion (or minimum theoretical information
criterion) (AIC), and AICc (AIC with a correction for the sample size) ranking.
Construction method A base tree is created automatically using one of two methods
from the Create Tree tool:
∗ The UPGMA method. Assumes constant rate of evolution.
∗ The Neighbor Joining method. Well suited for trees with varying rates of evolution.
• Hierarchical likelihood ratio test (hLRT) parameters A statistical test of the goodness-of-fit
between two models that compares a relatively more complex model to a simpler model to
see if it fits a particular dataset significantly better.
The output from model testing is a report that lists all test results in table format. For each
tested model the report indicate whether it is recommended to use rate variation or not. Topology
variation is recommended in all cases.
From the listed test results, it is up to the user to select the most appropriate model. The
different statistical tests will usually agree on which models to recommend although variations
may occur. Hence, in order to select the best possible model, it is recommended to select the
model that has proven to be the best by most tests.
• Start tree
Construction method Specify the tree construction method which should be used to
create the initial tree, Neighbor Joining or UPGMA
Existing start tree Alternatively, an existing tree can be used as starting tree for the
tree reconstruction. Click on the folder icon to the right of the text field to specify the
desired starting tree.
Nucleotice substitution model CLC Main Workbench allows maximum likelihood tree
estimation to be performed under the assumption of one of five nucleotide substitution
models:
∗ Jukes-Cantor [Jukes and Cantor, 1969]
CHAPTER 17. PHYLOGENETIC TREES 356
• Rate variation
To enable variable substitution rates among individual nucleotide sites in the alignment,
select the include rate variation box. When selected, the discrete gamma model of
Yang [Yang, 1994b] is used to model rate variation among sites. The number of categories
used in the discretization of the gamma distribution as well as the gamma distribution
parameter may be adjusted by the user (as the gamma distribution is restricted to have
mean 1, there is only one parameter in the distribution).
• Estimation
Estimation is done according to the maximum likelihood principle, that is, a search is
performed for the values of the free parameters in the model assumed that results in the
highest likelihood of the observed alignment [Felsenstein, 1981]. By ticking the Estimate
substitution rate parameters box, maximum likelihood values of the free parameters in the
rate matrix describing the assumed substitution model are found. If the Estimate topology
box is selected, a search in the space of tree topologies for that which best explains the
alignment is performed. If left un-ticked, the starting topology is kept fixed at that of the
starting tree.
The Estimate Gamma distribution parameter is active if rate variation has been included
in the model and in this case allows estimation of the Gamma distribution parameter
to be switched on or off. If the box is left un-ticked, the value is fixed at that given
in the Rate variation part. In the absence of rate variation estimation of substitution
parameters and branch lengths are carried out according to the expectation maximization
CHAPTER 17. PHYLOGENETIC TREES 357
algorithm [Dempster et al., 1977]. With rate variation the maximization algorithm is
performed. The topology space is searched according to the PHYML method [Guindon and
Gascuel, 2003], allowing efficient search and estimation of large phylogenies. Branch
lengths are given in terms of expected numbers of substitutions per nucleotide site.
In the next step of the wizard it is possible to perform bootstrapping (figure 17.9).
To evaluate the reliability of the inferred trees, CLC Main Workbench allows the option of doing a
bootstrap analysis (see section 17.4.1). A bootstrap value will be attached to each node, and
this value is a measure of the confidence in the subtree rooted at the node. The number of
replicates in the bootstrap analysis can be adjusted in the wizard by specifying the number of
times to resample the data. The default value is 100 resamples. The bootstrap value assigned
to a node in the output tree is the percentage (0-100) of the bootstrap resamples which resulted
in a tree containing the same subtree as that rooted at the node.
Figure 17.10: A proposed phylogeny of the great apes (Hominidae). Different components of the
tree are marked, see text for description.
CHAPTER 17. PHYLOGENETIC TREES 358
The ordering of the nodes determine the tree topology and describes how lineages have diverged
over the course of evolution. The branches of the tree represent the amount of evolutionary
divergence between two nodes in the tree and can be based on different measurements. A tree
is completely specified by its topology and the set of all edge lengths.
The phylogenetic tree in figure 17.10 is rooted at the most recent common ancestor of all
Hominidae species, and therefore represents a hypothesis of the direction of evolution e.g. that
the common ancestor of gorilla, chimpanzee and man existed before the common ancestor
of chimpanzee and man. In contrast, an unrooted tree would represent relationships without
assumptions about ancestry.
Besides evolutionary biology and systematics the inference of phylogenies is central to other
areas of research.
As more and more genetic diversity is being revealed through the completion of multiple
genomes, an active area of research within bioinformatics is the development of comparative
machine learning algorithms that can simultaneously process data from multiple species [Siepel
and Haussler, 2004]. Through the comparative approach, valuable evolutionary information can
be obtained about which amino acid substitutions are functionally tolerant to the organism and
which are not. This information can be used to identify substitutions that affect protein function
and stability, and is of major importance to the study of proteins [Knudsen and Miyamoto,
2001]. Knowledge of the underlying phylogeny is, however, paramount to comparative methods
of inference as the phylogeny describes the underlying correlation from shared history that exists
between data from different species.
In molecular epidemiology of infectious diseases, phylogenetic inference is also an important
tool. The very fast substitution rate of microorganisms, especially the RNA viruses, means that
these show substantial genetic divergence over the time-scale of months and years. Therefore,
the phylogenetic relationship between the pathogens from individuals in an epidemic can be
resolved and contribute valuable epidemiological information about transmission chains and
epidemiologically significant events [Leitner and Albert, 1999], [Forsberg et al., 2001].
Common to all these models is that they assume mutations at different sites in the genome
occur independently and that the mutations at each site follow the same common probability
CHAPTER 17. PHYLOGENETIC TREES 359
distribution. Thus all five models provide relative frequencies for each of the 16 possible DNA
substitutions (e.g. C → A, C → C, C → G,...).
The Jukes-Cantor and Kimura 80 models assume equal base frequencies and the HKY and GTR
models allow the frequencies of the four bases to differ (they will be estimated by the observed
frequencies of the bases in the alignment). In the Jukes-Cantor model all substitutions are
assumed to occur at equal rates, in the Kimura 80 and HKY models transition and transversion
rates are allowed to differ (substitution between two purines (A ↔ G) or two pyrimidines (C ↔ T )
are transitions and purine - pyrimidine substitutions are transversions). The GTR model is the
general time reversible model that allows all substitutions to occur at different rates. For the
substitution rate matrices describing the substitution models we use the parametrization of
Yang [Yang, 1994a].
For protein sequences, our Maximum Likelihood Phylogeny tool supports four substitution models:
As with nucleotide substitution models, it is assumed that mutations at different sites in the
genome occur independently and according to the same probability distribution.
The Bishop-Friday model assumes all amino acids occur with same frequency and that all
substitutions are equally likely. This is the simplest model, but also the most unrealistic. The
remaining three models use amino acid frequencies and substitution rates which have been
determined from large scale experiments where huge sets of protein sequences have been
aligned and rates have been estimated. These three models reflect the outcome of three
different experiments. We recommend using WAG as these rates where estimated from the
largest experiment.
the k-mers should have a length (k) that is somewhat below the average distance between
mismatches if the input sequences were aligned (in the extreme case of k=the length of the
sequences, two organisms have a maximum distance if they are not identical). Thus the selected
k value should not be too large and not too small. A general rule of thumb is to only use k-mer
based distance estimation for organisms that are not too distantly related.
Formal definition of distance. In the following, we give a more formal definition of the three
supported distance measures: Euclidian-squared, Mahalanobis and Fractional common k-mer
count. For all three, we first associate a point p(s) to every input sequence s. Each point p(s) has
one coordinate for every possible length k sequence (e.g. if s represent nucleotide sequences,
then p(s) has 4k coordinates). The coordinate corresponding to a length k sequence x has the
value: "number of times x occurs as a subsequence in s". Now for two sequences s1 and s2 ,
their evolutionary distance is defined as follows:
• Euclidian squared: For this measure, the distance is simply defined as the (squared
Euclidian) distance between the two points p(s1 ) and p(s2 ), i.e.
X
dist(s1 , s2 ) = (p(s1 )i − p(s2 )i )2 .
i
Here the standard deviations can be computed directly from a set of equilibrium frequencies
for the different bases, see [Gentleman and Mullin, 1989].
• Fractional common k-mer count: For the last measure, the distance is computed based
on the minimum count of every k-mer in the two sequences, thus if two sequences are very
different, the minimums will all be small. The formula is as follows:
X
dist(s1 , s2 ) = log(0.1 + (min(p(s1 )i , p(s2 )i )/(min(n, m) − k + 1))).
i
Here n is the length of s1 and m is the length of s2 . This method has been described
in [Edgar, 2004].
In experiments performed in [Höhl et al., 2007], the Mahalanobis distance measure seemed to
be the best performing of the three supported measures.
underestimate of the real distance as multiple mutations could have occurred at any position. To
correct for these hidden substitutions a substitution model, such as Jukes-Cantor or Kimura 80,
can be used to get a more precise distance estimate (see section 17.4.1).
To correct for these hidden substitutions a substitution model, such as Jukes-Cantor or Kimura
80, can be used to get a more precise distance estimate.
Alternatively, k-mer based methods or SNP based methods can be used to get a distance
estimate without the use of substitution models.
After distance estimates have been computed, a phylogenetic tree can be reconstructed using
a distance based reconstruction method. Most distance based methods perform a bottom up
reconstruction using a greedy clustering algorithm. Initially, each input organism is put in its
own cluster which corresponds to a leaf node in the resulting tree. Next, pairs of clusters are
iteratively joined into higher level clusters, which correspond to connecting two nodes in the tree
with a new parent node. When a single node remains, the tree is reconstructed.
The CLC Main Workbench provides two of the most widely used distance based reconstruction
methods:
• The UPGMA method [Michener and Sokal, 1957] which assumes a constant rate of
evolution (molecular clock hypothesis) in the different lineages. This method reconstruct
trees by iteratively joining the two nearest clusters until there is only one cluster left. The
result of the UPGMA method is a rooted bifurcating tree annotated with branch lengths.
• The Neighbor Joining method [Saitou and Nei, 1987] attempts to reconstruct a minimum
evolution tree (a tree where the sum of all branch lengths is minimized). Opposite to the
UPGMA method, the neighbor joining method is well suited for trees with varying rates of
evolution in different lineages. A tree is reconstructed by iteratively joining clusters which
are close to each other but at the same time far from all other clusters. The resulting tree
is a bifurcating tree with branch lenghts. Since no particular biological hypothesis is made
about the placement of the root in this method, the resulting tree is unrooted.
Bootstrap tests
Bootstrap tests [Felsenstein, 1985] is one of the most common ways to evaluate the reliability
of the topology of a phylogenetic tree. In a bootstrap test, trees are evaluated using Efron's re-
sampling technique [Efron, 1982], which samples nucleotides from the original set of sequences
as follows:
Given an alignment of n sequences (rows) of length l (columns), we randomly choose l columns
in the alignment with replacement and use them to create a new alignment. The new alignment
has n rows and l columns just like the original alignment but it may contain duplicate columns
and some columns in the original alignment may not be included in the new alignment. From
this new alignment we reconstruct the corresponding tree and compare it to the original tree.
For each subtree in the original tree we search for the same subtree in the new tree and add a
score of one to the node at the root of the subtree if the subtree is present in the new tree. This
procedure is repeated a number of times (usually around 100 times). The result is a counter for
each interior node of the original tree, which indicate how likely it is to observe the exact same
subtree when the input sequences are sampled. A bootstrap value is then computed for each
interior node as the percentage of resampled trees that contained the same subtree as that
rooted at the node.
Bootstrap values can be seen as a measure of how reliably we can reconstruct a tree, given
the sequence data available. If all trees reconstructed from resampled sequence data have very
different topologies, then most bootstrap values will be low, which is a strong indication that the
topology of the original tree cannot be trusted.
Scale bar
The scale bar unit depends on the distance measure used and the tree construction algorithm
used. The trees produced using the Maximum Likelihood Phylogeny tool has a very specific
interpretation: A distance of x means that the expected number of substitutions/changes per
nucleotide (amino acid for protein sequences) is x. i.e. if the distance between two taxa is 0.01,
you expected a change in each nucleotide independently with probability 1 %. For the remaining
algorithms, there is not as nice an interpretation. The distance depends on the weight given to
different mutations as specified by the distance measure.
17.5.1 Minimap
The Minimap is a navigation tool that shows a small version of the tree. A grey square indicates
the specific part of the tree that is visible in the View Area (figure 17.12). To navigate the tree
using the Minimap, click on the Minimap with the mouse and move the grey square around within
the Minimap.
Figure 17.12: Visualization of a phylogenetic tree. The grey square in the Minimap shows the part
of the tree that is shown in the View Area.
Figure 17.13: The tree layout can be adjusted in the Side Panel. The top part of the figure shows a
tree with increasing node order. In the bottom part of the figure the tree has been reverted to the
original tree topology.
• Layout Selects one of the five layout types: Phylogram, Cladogram, Circular Phylogram,
Circular Cladogram or Radial. Note that only the Cladogram layouts are available if all
branches in the tree have zero length.
Phylogram is a rooted tree where the edges have "lengths", usually proportional to
the inferred amount of evolutionary change to have occurred along each branch.
Cladogram is a rooted tree without branch lengths which is useful for visualizing the
topology of trees.
Circular Phylogram is also a phylogram but with the leaves in a circular layout.
Circular Cladogram is also a cladogram but with the leaves in a circular layout.
Radial is an unrooted tree that has the same topology and branch lengths as the
rooted styles, but lacks any indication of evolutionary direction.
• Ordering The nodes can be ordered after the branch length; either Increasing (shown in
figure 17.13) or Decreasing.
• Reset Tree Topology Resets to the default tree topology and node order (see figure 17.13).
Any previously collapsed nodes will be uncollapsed.
CHAPTER 17. PHYLOGENETIC TREES 365
• Fixed width on zoom Locks the horizontal size of the tree to the size of the main window.
Zoom is therefore only performed on the vertical axis when this option is enabled.
• Show as unrooted tree The tree can be shown with or without a root.
• Leaf node symbol Leaf nodes can be shown as a range of different symbols (Dot, Box,
Circle, etc.).
• Internal node symbols The internal nodes can also be shown with a range of different
symbols (Dot, Box, Circle, etc.).
• Max. symbol size The size of leaf- and internal node symbols can be adjusted.
• Avoid overlapping symbols The symbol size will be automatically limited to avoid overlaps
between symbols in the current view.
• Node color Specify a fixed color for all nodes in the tree.
The node layout settings in the Side Panel are shown in figure 17.14.
Figure 17.14: The Node Layout settings. Node color is specified by metadata and is therefore
inactive in this example.
• Hide overlapping labels Disable automatic hiding of overlapping labels and display all labels
even if they overlap.
• Show internal node labels Labels for internal nodes of the tree (if any) can be displayed.
Please note that subtrees and nodes can be labeled with a custom text. This is done by
right clicking the node and selecting Edit Label (see figure 17.15).
• Show leaf node labels Leaf node labels can be shown or hidden.
• Rotate Subtree labels Subtree labels can be shown horizontally or vertically. Labels are
shown vertically when "Rotate subtree labels" has been selected. Subtree labels can
be added with the right click option "Set Subtree Label" that is enabled from "Decorate
subtree" (see section 17.5.9).
• Align labels Align labels to the node furthest from the center of the tree so that all labels
are positioned next to each other. The exact behavior depends on the selected tree layout.
• Connect labels to nodes Adds a thin line from the leaf node to the aligned label. Only
possible when Align labels option is selected.
Figure 17.15: "Edit label" in the right click menu can be used to customize the label text. The way
node labels are displayed can be controlled through the labels settings in the right side panel.
When working with big trees there is typically not enough space to show all labels. As illustrated
in figure 17.15, only some of the labels are shown. The hidden labels are illustrated with thin
horizontal lines (figure 17.16).
There are different ways of showing more labels. One way is to reduce the font size of the labels,
which can be done under Label font settings in the Side Panel. Another option is to zoom in
on specific areas of the tree (figure 17.16 and figure 17.17). The last option is to disable Hide
overlapping labels under "Label settings" in the right side panel. When this option is unchecked
all labels are shown even if the text overlaps. When allowing overlapping labels it is usually a
good idea to disable Show label background under "Background settings" (see section 17.5.5).
Note! When working with a tree with hidden labels, it is possible to make the hidden label text
appear by moving the mouse over the node with the hidden label.
CHAPTER 17. PHYLOGENETIC TREES 367
Note! The text within labels can be edited by editing the metadata table values directly.
Figure 17.16: The zoom function in the upper right corner of the Workbench can be used to zoom
in on a particular region of the tree. When the zoom function has been activated, use the mouse
to drag a rectangle over the area that you wish to zoom in at.
Figure 17.17: After zooming in on a region of interest more labels become visible. In this example
all labels are now visible.
• Curvature Adjust the degree of branch curvature to get branches with round corners.
• Min. length Select a minimum branch length. This option can be used to prevent nodes
connected with a short branch to cluster at the parent node.
The branch layout settings in the Side Panel are shown in figure 17.18.
• Bootstrap value font settings Specify/adjust font type, size and typography (Bold, Italic or
normal).
• Show bootstrap values (%) Show or hide bootstrap values. When selected, the bootstrap
values (in percent) will be displayed on internal nodes if these have been computed during
the reconstruction of the tree.
CHAPTER 17. PHYLOGENETIC TREES 369
• Bootstrap threshold (%) When specifying a bootstrap threshold, the branch lengths can
be controlled manually by collapsing internal nodes with bootstrap values under a certain
threshold.
• Highlight bootstrap ≥ (%) Highlights branches where the bootstrap value is above the user
defined threshold.
• Node symbol size Change the node symbol size to visualize metadata.
• Label text The metadata can be shown directly as text labels as shown in figure 17.19.
• Label text color The label text can be colored and used to visualize metadata (see
figure 17.19).
• Label background color The background color of node text labels can be used to visualize
metadata.
Please note that when visualizing metadata through a tree property that can be adjusted in the
right side panel (such as node color or node size), an exclamation mark will appear next to the
control for that property to indicate that the setting is inactive because it is defined by metadata
(see figure 17.14).
• Set Root At This Node Re-root the tree using the selected node as root. Please note that
re-rooting will change the tree topology. This option is only available for internal nodes, not
leaf nodes.
• Set Root Above Node Re-root the tree by inserting a node between the selected node and
its parent. Useful for rooting trees using an outgroup.
• Collapse Branches associated with a selected node can be collapsed with or without the
associated labels. Collapsed branches can be uncollapsed using the Uncollapse option in
the same menu.
CHAPTER 17. PHYLOGENETIC TREES 370
Figure 17.19: Different types of metadata kan be visualized by adjusting node size, shape, and
color. Two color-code metadata layers (Year and Host) are shown in the right side of the tree.
• Hide Can be used to hide a node or a subtree. Hidden nodes or subtrees can be shown
again using the Show Hidden Subtree function on a node which is root in a subtree
containing hidden nodes (see figure 17.20). When hiding nodes, a new button appears
labeled "Show X hidden nodes" in the Side Panel under "Tree Layout" (figure 17.21). When
pressing this button, all hidden nodes are shown again.
• Decorate Subtree A subtree can be labeled with a customized name, and the subtree lines
and/or background can be colored. To save the decoration, see figure 17.11 and use
option: Save/Restore Settings | Save Tree View Settings On This Tree View only.
• Extract Sequence List Sequences associated with selected leaf nodes are extracted to a
new sequence list.
• Align Sequences Sequences associated with selected leaf nodes are extracted and used
as input to the Create Alignment tool.
• Assign Metadata Metadata can be added, deleted or modified. To add new metadata
categories a new "Name" must be assigned. (This will be the column header in the
metadata table). To add a new metadata category, enter a value in the "Value" field. To
delete values, highlight the relevant nodes and right click on the selected nodes. In the
dialog that appears, use the drop-down list to select the name of the desired metadata
category and leave the value field empty. When pressing "Add" the values for the selected
metadata category will be deleted from the selected nodes. Metadata can be modified
in the same way, but instead of leaving the value field empty, the new value should be
entered.
CHAPTER 17. PHYLOGENETIC TREES 371
Figure 17.20: A subtree can be hidden by selecting "Hide Subtree" and is shown again when
selecting "Show Hidden Subtree" on a parent node.
Figure 17.21: When hiding nodes, a new button labeled "Show X hidden nodes" appears in the
Side Panel under "Tree Layout". When pressing this button, all hidden nodes are brought back.
• Edit label Edit the text in the selected node label. Labels can be shown or hidden by using the
Side Panel: Label settings | Show internal node labels
• Branch length The length of the branch, which connects a node to the parent node.
• Size The length of the sequence which corresponds to each leaf node. This only applies to
leaf nodes.
• Start of sequence The first 50bp of the sequence corresponding to each leaf node.
To view metadata associated with a phylogenetic tree, click on the table icon ( ) at the bottom
of the tree. If you hold down the Ctrl key (or on Mac) while clicking on the table icon ( ), you
will be able to see both the tree and the table in a split view (figure 17.22).
Figure 17.22: Tabular metadata that is associated with an existing tree shown in a split view.
Note that Unknown written in italics (black branches) refer to missing metadata, while Unknown in
regular font refers to metadata labeled as "Unknown".
Additional metadata can be associated with a tree by clicking the Import Metadata button. This
will open up the dialog shown in figure 17.23.
To associate metadata with an existing tree a common denominator is required. This is achieved
by mapping the node names in the "Name" column of the metadata table to the names that
have been used in the metadata table to be imported. In this example the "Strain" column holds
the names of the nodes and this column must be assigned "Name" to allow the importer to
associate metadata with nodes in the tree.
CHAPTER 17. PHYLOGENETIC TREES 373
Figure 17.23: Import of metadata for a tree. The second column named "Strain" is choosen as the
common denominator by entering "Name" in the text field of the column. The column labeled "H"
is ignored by not assigning a column heading to this column.
• Column width The column width can be adjusted in two ways; Manually or Automatically.
• Show column Selects which metadata categories that are shown in the table layout.
• Assign Metadata The right click option "Assign Metadata" can be used for four purposes.
CHAPTER 17. PHYLOGENETIC TREES 374
Figure 17.24: Metadata table. The column width can be adjusted manually or automatically. Under
"Show column" it is possible to select which columns should be shown in the table. Filtering using
specific criteria can be performed.
To add new metadata categories (columns). In this case, a new "Name" must be
assigned, which will be the column header. To add a new column requires that a value
is entered in the "Value" field. This can be done by right clicking anywhere in the table.
To add values to one or more rows in an existing column. In this case, highlight the
relevant rows and right click on the selected rows. In the dialog that appears, use the
drop-down list to select the name of the desired column and enter a value.
To delete values from an existing column. This is done in the same way as when
adding a new value, with the only exception that the value field should be left empty.
To delete metadata columns. This is done by selecting all rows in the table followed by
a right click anywhere in the table. Select the name of the column to delete from the
drop down menu and leave the value field blank. When pressing "Add", the selected
column will disappear.
• Delete Metadata "column header" This is the most simple way of deleting a metadata
column. Click on one of the rows in the column to delete and select "Delete column
header".
• Edit "column header" To modify existing metadata point, right click on a cell in the table
and select the "Edit column header". To edit multiple entries at once, select multiple rows
in the table, right click a selected cell in the column you want to edit and choose "Edit
column header" (see an example in figure 17.26). This will change values in all selected
rows in the column that was clicked.
Figure 17.26: To modify existing metadata, click on the specific field, select "Edit <column header>"
and provide a new value.
top of the legend (see the entry "Unknown" in figure 17.27). To remove this entry in the legend,
all nodes must have a value assigned in the corresponding metadata category.
Figure 17.27: A legend for a metadata category where one or more values are undefined. Fill your
metadata table with a value of your choice to edit the mention of "("Unknown" in the legend. Note
that the "Unknown" that is not in italics is used for data that had a value written as "Unknown" in
the metadata table.
• Selection of a single node Click once on a single node. Additional nodes can be added by
holding down Ctrl (or for Mac) and clicking on them (see figure 17.28).
• Selecting all nodes in a subtree Double clicking on a inner node results in the selection of
all nodes in the subtree rooted at the node.
• Selection via the Metadata table Select one or more entries in the table. The corresponding
nodes will now be selected in the tree.
It is possible to extract a subset of the underlying sequence data directly through either the tree
viewer or the metadata table as follows. Select one or more nodes in the tree where at least
CHAPTER 17. PHYLOGENETIC TREES 376
one node has a sequence attached. Right click one of the selected nodes and choose Extract
Sequence List. This will generate a new sequence list containing all sequences attached to
the selected nodes. The same functionality is available in the metadata table where sequences
can be extracted from selected rows using the right click menu. Please note that all extracted
sequences are copies and any changes to these sequences will not be reflected in the tree.
When analyzing a phylogenetic tree it is often convenient to have a multiple alignment of
sequences from e.g. a specific clade in the tree. A quick way to generate such an alignment
is to first select one or more nodes in the tree (or the corresponding entries in the metadata
table) and then select Align Sequences in the right click menu. This will extract the sequences
corresponding to the selected elements and use a copy of them as input to the multiple alignment
tool (see section 16.5.2). Next, change relevant option in the multiple alignment wizard that pops
up and click Finish. The multiple alignment will now be generated.
Figure 17.28: Cherry picking nodes in a tree. The selected leaf sequences can be extracted by
right clicking on one of the selected nodes and selecting "Extract Sequence List". It is also possible
to Align Sequences directly by right clicking on the nodes or leaves.
Chapter 18
Contents
18.1 Annotate with GFF/GTF/GVF file . . . . . . . . . . . . . . . . . . . . . . . . 377
18.2 Extract sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 380
18.3 Shuffle sequence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 381
18.4 Dot plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 382
18.4.1 Create dot plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 383
18.4.2 View dot plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 383
18.4.3 Bioinformatics explained: Dot plots . . . . . . . . . . . . . . . . . . . . . 384
18.4.4 Bioinformatics explained: Scoring matrices . . . . . . . . . . . . . . . . 389
18.5 Local complexity plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 391
18.6 Sequence statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 391
18.6.1 Bioinformatics explained: Protein statistics . . . . . . . . . . . . . . . . 394
18.7 Join Sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 396
18.8 Pattern discovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 397
18.8.1 Pattern discovery search parameters . . . . . . . . . . . . . . . . . . . . 398
18.8.2 Pattern search output . . . . . . . . . . . . . . . . . . . . . . . . . . . . 399
18.9 Motif Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 399
18.9.1 Dynamic motifs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 399
18.9.2 The Motif Search tool . . . . . . . . . . . . . . . . . . . . . . . . . . . . 401
18.9.3 Java regular expressions . . . . . . . . . . . . . . . . . . . . . . . . . . 403
18.10 Create motif list . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 404
CLC Main Workbench offers different kinds of sequence analyses that apply to both protein and
DNA.
The analyses are described in this chapter.
377
CHAPTER 18. GENERAL SEQUENCE ANALYSES 378
the names of the sequences to be annotated. If this is not the case, either the names in the
annotation file, or the names of the sequences, must be updated.
Tools are available for renaming sequences or sequences in sequence lists:
See http://gmod.org/wiki/GFF3 for information about the GFF3 format and https://mblab.wustl.
edu/GTF22.html for information on the GTF format.
• A gene annotation is generated for each gene_id. The region annotated extends from the
leftmost to the rightmost positions of all annotations that have the gene_id (gtf-style).
• CDS annotations that have the same transcriptID are joined to one CDS annotation (gtf-
style). Similarly, CDS annotations that have the same parent are joined to one CDS
annotation (gff-style).
• If there is more than one exon annotation with the same transcriptID these are joined to
one mRNA annotation. If there is only one exon annotation with a particular transcriptID,
and no CDS with this transcriptID, a transcript annotation is added instead of the exon
annotation (gtf-style).
• Exon annotations that have the same mRNA as parent are joined to one mRNA annotation.
Similarly, exon annotations that have the same transcript as parent, are joined to one
transcript annotation (gff-style).
Note that genes and transcripts are linked by name only (not by position, ID etc).
1. If one of the following qualifiers are present, it will be used for naming (prioritized):
CHAPTER 18. GENERAL SEQUENCE ANALYSES 379
Figure 18.1: Select a GFF, GTF or GVR file by clicking on the Browse button.
(a) Name
(b) Gene_name
(c) Gene_ID
(d) Locus_tag
(e) ID
2. If none of these are found, the annotation type will be used as name.
You can overrule this naming convention by choosing Replace all annotation names with this
qualifier and specifying another qualifier (see figure 18.2).
If you provide a qualiifer, it must be written identically to the corresponding qualifier name in the
annotation file.
Transcript annotations are handled separately, since they inherit the name from the gene
annotation.
Figure 18.2: You can choose Replace all annotation names with the specified qualifier.
Type handling
CHAPTER 18. GENERAL SEQUENCE ANALYSES 380
You can overrule feature types in the annotation file by choosing Replace all annotation types
with and specifying a type to use.
Ignore duplicate annotation
When the Ignore duplicate annotation option is checked, only one instance of duplicate
annotations will be added to the sequence.
Create log
In the Result handling section of the wizard, check the Create log box results to create a log that
includes information like the number of annotations found and if there are any that are could not
be placed on the sequence. This information can help with troubleshooting when annotations are
not added to a sequence when they were expected to be.
• Alignments ( )
• BLAST result ( ) For BLAST results, the sequence hits are extracted but not the original
query sequence or the consensus sequence.
• Contigs and read mappings ( ) For mappings, only the read sequences are extracted.
Reference and consensus sequences are not extracted using this tool.
• Sequence lists ( ) See further notes below about running this tool on sequence lists.
If only a subset of the sequences are of interest, create an element containing just this subset
first, and then run Extract Sequences on this. See the documentation for the relevant element
types for further details. For example, for extracting a subset of a mapping, see section 21.7.6.
Paired reads are extracted in accordance with the read group settings, which are specified during
the original import of the reads. If the orientation has since been changed (for example using the
Element Info tab for the sequence list), the read group information will be modified and reads
will be extracted as specified by the modified read group. The default read group orientation is
forward-reverse.
Extracting sequences from sequence lists: As all sequences will be extracted, the main reason
to run this tool on a sequence list would be if you wished to create individual sequence elements
from each sequence in the list. This is somewhat uncommon. If your aim is to create a list
containing a subset of the sequences from another list, this can be done directly from the table
view of sequence lists (see section 14.1.3), or using Split Sequence List (see section 27.7).
CHAPTER 18. GENERAL SEQUENCE ANALYSES 381
Figure 18.3: Extracted sequences can be put into a new sequence list or split into individual
sequence elements.
• Dinucleotide shuffling. Shuffle method generating a sequence of the exact same dinu-
cleotide frequency
• Mononucleotide sampling from zero order Markov chain. Resampling method generating
a sequence of the same expected mononucleotide frequency.
• Dinucleotide sampling from first order Markov chain. Resampling method generating a
sequence of the same expected dinucleotide frequency.
• Single amino acid shuffling. Shuffle method generating a sequence of the exact same
amino acid frequency.
• Single amino acid sampling from zero order Markov chain. Resampling method generating
a sequence of the same expected single amino acid frequency.
• Dipeptide shuffling. Shuffle method generating a sequence of the exact same dipeptide
frequency.
• Dipeptide sampling from first order Markov chain. Resampling method generating a
sequence of the same expected dipeptide frequency.
For further details of these algorithms, see [Clote et al., 2005]. In addition to the shuffle method,
you can specify the number of randomized sequences to output.
Click Finish to start the tool.
This will open a new view in the View Area displaying the shuffled sequence. The new sequence
is not saved automatically. To save the sequence, drag it into the Navigation Area or press ctrl
+ S ( + S on Mac) to activate a save dialog.
• Distance correction (only valid for protein sequences) In order to treat evolutionary
transitions of amino acids, a distance correction measure can be used when calculating
the dot plot. These distance correction matrices (substitution matrices) take into account
the likeliness of one amino acid changing to another.
• Window size A residue by residue comparison (window size = 1) would undoubtedly result in
a very noisy background due to a lot of similarities between the two sequences of interest.
For DNA sequences the background noise will be even more dominant as a match between
only four nucleotide is very likely to happen. Moreover, a residue by residue comparison
(window size = 1) can be very time consuming and computationally demanding. Increasing
the window size will make the dot plot more 'smooth'.
Note! Calculating dot plots takes up a considerable amount of memory in the computer.
Therefore, you will see a warning message if the sum of the number of nucleotides/amino acids
in the sequences is higher than 8000. If you insist on calculating a dot plot with more residues
the Workbench may shut down, but still allowing you to save your work first. However, this
depends on your computer's memory configuration.
Click Finish to start the tool.
Adjusting the sliders above the gradient box is also practical, when producing an output for
printing where too much background color might not be desirable. By crossing one slider over
the other (the two sliders change side) the colors are inverted, allowing for a white background
(figure 18.7).
Figure 18.7: Dot plot with inverted colors, practical for printing.
The scores that are drawn on the plot are affected by several issues.
• Window size
The single residue comparison (bit by bit comparison(window size = 1)) in dot plots will
undoubtedly result in a noisy background of the plot. You can imagine that there are many
successes in the comparison if you only have four possible residues like in nucleotide
sequences. Therefore you can set a window size which is smoothing the dot plot. Instead
of comparing single residues it compares subsequences of length set as window size. The
score is now calculated with respect to aligning the subsequences.
• Threshold
The dot plot shows the calculated scores with colored threshold. Hence you can better
recognize the most important similarities.
Similar sequences The most simple example of a dot plot is obtained by plotting two homologous
sequences of interest. If very similar or identical sequences are plotted against each other a
diagonal line will occur.
The dot plot in figure 18.8 shows two related sequences of the Influenza A virus nucleoproteins
infecting ducks and chickens. Accession numbers from the two sequences are: DQ232610
and DQ023146. Both sequences can be retrieved directly from http://www.ncbi.nlm.nih.
gov/gquery/gquery.fcgi.
CHAPTER 18. GENERAL SEQUENCE ANALYSES 386
Figure 18.8: Dot plot of DQ232610 vs. DQ023146 (Influenza A virus nucleoproteins) showing and
overall similarity
Repeated regions Sequence repeats can also be identified using dot plots. A repeat region will
typically show up as lines parallel to the diagonal line.
Figure 18.9: Direct and inverted repeats shown on an amino acid sequence generated for
demonstration purposes.
If the dot plot shows more than one diagonal in the same region of a sequence, the regions
depending to the other sequence are repeated. In figure 18.10 you can see a sequence with
repeats.
CHAPTER 18. GENERAL SEQUENCE ANALYSES 387
Figure 18.10: The dot plot of a sequence showing repeated elements. See also figure 18.9.
Frame shifts Frame shifts in a nucleotide sequence can occur due to insertions, deletions or
mutations. Such frame shifts can be visualized in a dot plot as seen in figure 18.11. In this
figure, three frame shifts for the sequence on the y-axis are found.
1. Deletion of nucleotides
2. Insertion of nucleotides
Figure 18.11: This dot plot show various frame shifts in the sequence. See text for details.
CHAPTER 18. GENERAL SEQUENCE ANALYSES 388
Sequence inversions In dot plots you can see an inversion of sequence as contrary diagonal to
the diagonal showing similarity. In figure 18.12 you can see a dot plot (window length is 3) with
an inversion.
Figure 18.12: The dot plot showing an inversion in a sequence. See also figure 18.9.
Figure 18.13: The dot plot showing a low-complexity region in the sequence. The sequence is
artificial and low complexity regions do not always show as a square.
CHAPTER 18. GENERAL SEQUENCE ANALYSES 389
Table 18.1: The BLOSUM62 matrix. A tabular view of the BLOSUM62 matrix containing all
possible substitution scores [Henikoff and Henikoff, 1992].
Based on evolution of proteins it became apparent that these changes or substitutions of amino
acids can be modeled by a scoring matrix also refereed to as a substitution matrix. See an
example of a scoring matrix in table 18.1. This matrix lists the substitution scores of every
single amino acid. A score for an aligned amino acid pair is found at the intersection of the
corresponding column and row. For example, the substitution score from an arginine (R) to
a lysine (K) is 2. The diagonal show scores for amino acids which have not changed. Most
substitutions changes have a negative score. Only rounded numbers are found in this matrix.
The two most used matrices are the BLOSUM [Henikoff and Henikoff, 1992] and PAM [Dayhoff
and Schwartz, 1978].
• PAM
The first PAM matrix (Point Accepted Mutation) was published in 1978 by Dayhoff et al. The
PAM matrix was build through a global alignment of related sequences all having sequence
similarity above 85% [Dayhoff and Schwartz, 1978]. A PAM matrix shows the probability
that any given amino acid will mutate into another in a given time interval. As an example,
PAM1 gives that one amino acid out of a 100 will mutate in a given time interval. In the
other end of the scale, a PAM256 matrix, gives the probability of 256 mutations in a 100
amino acids (see figure 18.14).
CHAPTER 18. GENERAL SEQUENCE ANALYSES 390
There are some limitation to the PAM matrices which makes the BLOSUM matrices
somewhat more attractive. The dataset on which the initial PAM matrices were build is very
old by now, and the PAM matrices assume that all amino acids mutate at the same rate -
this is not a correct assumption.
• BLOSUM
In 1992, 14 years after the PAM matrices were published, the BLOSUM matrices (BLOcks
SUbstitution Matrix) were developed and published [Henikoff and Henikoff, 1992].
Henikoff et al. wanted to model more divergent proteins, thus they used locally aligned
sequences where none of the aligned sequences share less than 62% identity. This
resulted in a scoring matrix called BLOSUM62. In contrast to the PAM matrices the
BLOSUM matrices are calculated from alignments without gaps emerging from the BLOCKS
database.
Sean Eddy recently wrote a paper reviewing the BLOSUM62 substitution matrix and how to
calculate the scores [Eddy, 2004].
Use of scoring matrices Deciding which scoring matrix you should use in order of obtain the
best alignment results is a difficult task. If you have no prior knowledge on the sequence the
BLOSUM62 is probably the best choice. This matrix has become the de facto standard for scoring
matrices and is also used as the default matrix in BLAST searches. The selection of a "wrong"
scoring matrix will most probable strongly influence on the outcome of the analysis. In general a
few rules apply to the selection of scoring matrices.
• For closely related sequences choose BLOSUM matrices created for highly similar align-
ments, like BLOSUM80. You can also select low PAM matrices such as PAM1.
• For distant related sequences, select low BLOSUM matrices (for example BLOSUM45) or
high PAM matrices such as PAM250.
The BLOSUM matrices with low numbers correspond to PAM matrices with high numbers. (See
figure 18.14) for correlations between the PAM and BLOSUM matrices. To summarize, if you
want to find distant related proteins to a sequence of interest using BLAST, you could benefit of
using BLOSUM45 or similar matrices.
Figure 18.14: Relationship between scoring matrices. The BLOSUM62 has become a de facto
standard scoring matrix for a wide range of alignment programs. It is the default matrix in BLAST.
Click Finish to start the tool. The values of the complexity plot approaches 1.0 as the distribution
of amino acids become more complex.
See section A in the appendix for information about the graph view.
• Individual statistics layout. If more sequences were selected in Step 1, this function
generates separate statistics report for each sequence.
• Comparative statistics layout. If more sequences were selected in Step 1, this function
generates statistics with comparisons between the sequences.
For protein seqences, you can choose to include Background distribution of amino acids. If this
box is ticked, an extra column with amino acid distribution of the chosen species, is included in the
table output. (The distributions are calculated from UniProt https://uniprot.org/ version
6.0, dated September 13 2005.)
You can also choose between two different sets of values for calculation of extinction coefficients:
• [Gill and von Hippel, 1989]: Ext(Cystine) = 120, Ext(Tyr) = 1280 and Ext(Trp) = 5690
• [Pace et al., 1995]: Ext(Cystine) = 125, Ext(Tyr) = 1490 and Ext(Trp) = 5500
• Sequence Information:
Sequence type
CHAPTER 18. GENERAL SEQUENCE ANALYSES 393
Length
Organism
Name
Description
Modification Date
Weight. This is calculated like this: sumunitsinsequence (weight(unit)) − links ∗
weight(H2O) where links is the sequence length minus one and units are
amino acids. The atomic composition is defined the same way.
Isoelectric point
Aliphatic index
• Annotation counts
• General statistics:
Sequence type
Length
Organism
Name
Description
Modification Date
Weight (calculated as single-stranded and double-stranded DNA)
• Annotation table
If nucleotide sequences are used as input, and these are annotated with CDS, a section on
codon statistics for coding regions is included. This represents statistics for all codons; however,
only codons that contribute with amino acids to the translated sequence will be counted.
A short description of the different areas of the statistical output is given in section 18.6.1.
CHAPTER 18. GENERAL SEQUENCE ANALYSES 394
• Molecular weight The molecular weight is the mass of a protein or molecule. The molecular
weight is simply calculated as the sum of the atomic mass of all the atoms in the molecule.
The weight of a protein is usually represented in Daltons (Da).
A calculation of the molecular weight of a protein does not usually include additional
post-translational modifications. For native and unknown proteins it tends to be difficult to
assess whether posttranslational modifications such as glycosylations are present on the
protein, making a calculation based solely on the amino acid sequence inaccurate. The
molecular weight can be determined very accurately by mass-spectrometry in a laboratory.
• Isoelectric point The isoelectric point (pI) of a protein is the pH where the proteins has no
net charge. The pI is calculated from the pKa values for 20 different amino acids. At a
pH below the pI, the protein carries a positive charge, whereas if the pH is above pI the
proteins carry a negative charge. In other words, pI is high for basic proteins and low for
acidic proteins. This information can be used in the laboratory when running electrophoretic
gels. Here the proteins can be separated, based on their isoelectric point.
• Aliphatic index The aliphatic index of a protein is a measure of the relative volume occupied
by aliphatic side chain of the following amino acids: alanine, valine, leucine and isoleucine.
An increase in the aliphatic index increases the thermostability of globular proteins. The
index is calculated by the following formula.
Aliphaticindex = X(Ala) + a ∗ X(V al) + b ∗ X(Leu) + b ∗ (X)Ile
X(Ala), X(Val), X(Ile) and X(Leu) are the amino acid compositional fractions. The constants a
and b are the relative volume of valine (a=2.9) and leucine/isoleucine (b=3.9) side chains
compared to the side chain of alanine [Ikai, 1980].
• Estimated half-life The half life of a protein is the time it takes for the protein pool of that
particular protein to be reduced to the half. The half life of proteins is highly dependent on
the presence of the N-terminal amino acid, thus overall protein stability [Bachmair et al.,
1986, Gonda et al., 1989, Tobias et al., 1991]. The importance of the N-terminal residues
is generally known as the 'N-end rule'. The N-end rule and consequently the N-terminal
amino acid, simply determines the half-life of proteins. The estimated half-life of proteins
have been investigated in mammals, yeast and E. coli (see Table 18.2). If leucine is found
N-terminally in mammalian proteins the estimated half-life is 5.5 hours.
• Extinction coefficient This measure indicates how much light is absorbed by a protein at
a particular wavelength. The extinction coefficient is measured by UV spectrophotometry,
but can also be calculated. The amino acid composition is important when calculating
the extinction coefficient. The extinction coefficient is calculated from the absorbance of
cysteine, tyrosine and tryptophan.
Two values are reported. The first value, "Non-reduced cysteines", is computed assuming
that all cysteine residues appear as half cystines, meaning they form di-sulfide bridges to
other cysteines:
CHAPTER 18. GENERAL SEQUENCE ANALYSES 395
Table 18.2: Estimated half life. Half life of proteins where the N-terminal residue is listed in the
first column and the half-life in the subsequent columns for mammals, yeast and E. coli.
count(Cys)
Ext(P rotein) = · Ext(Cys) + count(T yr) · Ext(T yr) + count(T rp) · Ext(T rp).
2
The second value, "Reduced cysteines", assumes that no di-sulfide bonds are formed:
Ext(P rotein) = count(T yr) · Ext(T yr) + count(T rp) · Ext(T rp).
The extinction coefficient values of the three important amino acids at different wavelengths
are found in [Gill and von Hippel, 1989] or in [Pace et al., 1995]. At 280nm the extinction
coefficients are
[Gill and von Hippel, 1989]: Ext(Cystine) = 120, Ext(Tyr) = 1280 and Ext(Trp) = 5690
[Pace et al., 1995]: Ext(Cystine) = 125, Ext(Tyr) = 1490 and Ext(Trp) = 5500
pH 6.5
6.0 M guanidium hydrochloride
0.02 M phosphate buffer
Knowing the extinction coefficient, the absorbance (optical density) can be calculated using
Ext(P rotein)
the following formula: Absorbance(P rotein) =
M olecular weight
CHAPTER 18. GENERAL SEQUENCE ANALYSES 396
• Atomic composition Amino acids are indeed very simple compounds. All 20 amino acids
consist of combinations of only five different atoms. The atoms which can be found in these
simple structures are: Carbon, Nitrogen, Hydrogen, Sulfur, Oxygen. The atomic composition
of a protein can for example be used to calculate the precise molecular weight of the entire
protein.
• Total number of negatively charged residues (Asp + Glu) At neutral pH, the fraction
of negatively charged residues provides information about the location of the protein.
Intracellular proteins tend to have a higher fraction of negatively charged residues than
extracellular proteins.
• Total number of positively charged residues (Arg + Lys) At neutral pH, nuclear proteins
have a high relative percentage of positively charged amino acids. Nuclear proteins often
bind to the negatively charged DNA, which may regulate gene expression or help to fold the
DNA. Nuclear proteins often have a low percentage of aromatic residues [Andrade et al.,
1998].
• Amino acid distribution Amino acids are the basic components of proteins. The amino acid
distribution in a protein is simply the percentage of the different amino acids represented
in a particular protein of interest. Amino acid composition is generally conserved through
family-classes in different organisms which can be useful when studying a particular protein
or enzymes across species borders. Another interesting observation is that amino acid
composition variate slightly between proteins from different subcellular localizations. This
fact has been used in several computational methods, used for prediction of subcellular
localization.
• Annotation table This table provides an overview of all the different annotations associated
with the sequence and their incidence.
• Dipeptide distribution This measure is simply a count, or frequency, of all the observed
adjacent pairs of amino acids (dipeptides) found in the protein. It is only possible to report
neighboring amino acids. Knowledge on dipeptide composition have previously been used
for prediction of subcellular localization.
In step 2 you can change the order in which the sequences will be joined. Select a sequence and
use the arrows to move the selected sequence up or down.
Click Finish to start the tool.
The result is shown in figure 18.20.
Figure 18.20: The result of joining sequences is a new sequence containing the annotations of the
joined sequences (they each had a HBB annotation).
Figure 18.21: Setting parameters for the pattern discovery. See text for details.
Select to use an already existing model which is seen in figure 18.21. Models are represented
with the following icon in the Navigation Area ( ).
• Create and search with new model. This will create a new HMM model based on the
selected sequences. The found model will be opened after the run and presented in a table
view. It can be saved and used later if desired.
• Use existing model. It is possible to use already created models to search for the same
pattern in new sequences.
• Minimum pattern length. Here, the minimum length of patterns to search for, can be
specified.
• Maximum pattern length. Here, the maximum length of patterns to search for, can be
specified.
• Noise (%). Specify noise-level of the model. This parameter has influence on the level
of degeneracy of patterns in the sequence(s). The noise parameter can be 1,2,5 or 10
percent.
• Number of different kinds of patterns to predict. Number of iterations the algorithm goes
through. After the first iteration, we force predicted pattern-positions in the first run to be
member of the background: In that way, the algorithm finds new patterns in the second
iteration. Patterns marked 'Pattern1' have the highest confidence. The maximal iterations
to go through is 3.
Click Finish to start the tool. This will open a view showing the patterns found as annotations on
the original sequence (see figure 18.22). If you have selected several sequences, a corresponding
number of views will be opened.
CHAPTER 18. GENERAL SEQUENCE ANALYSES 399
• Search in an open sequence Common motifs and custom motifs can be quickly scanned
for and visualized on a sequence while working with that sequence interactively. See sec-
tion 18.9.1 for details.
• Use the Motif Search tool A more refined and systematic search for motifs can be
performed using the Motif Search tool. This generates a table and can optionally add
annotations to the sequences. See section section 18.9.2 for details.
Figure 18.23: The Motifs palette of the Side Panel of an open sequence. A single instance of the
CMV motif has been detected.
Figure 18.24: When the box next to a motif type is checked, any instances of that motif in a
sequence will be highlighted in the view.
Figure 18.25: Hover the mouse cursor over a motif region on the sequence to reveal a tool tip with
information about the motif.
Below the labels option there are two options for controlling the way the sequence should be
searched for motifs:
• Include reverse motifs. This will also find motifs on the negative strand (only available for
nucleotide sequences)
• Exclude matches in N-regions for simple motifs. The motif search handles ambiguous
characters in the way that two residues are different if they do not have any residues in
common. For example: For nucleotides, N matches any character and R matches A,G. For
proteins, X matches any character and Z matches E,Q. Genome sequence often have large
regions with unknown sequence. These regions are very often padded with N's. Ticking this
checkbox will not display hits found in N-regions and if a one residue in a motif matches to
an N, it will be treated as a mismatch.
The list of motifs shown in figure 18.23 is a pre-defined list that is included with the workbench,
but you can define your own set of motifs to use instead. In order to do this, you can either
CHAPTER 18. GENERAL SEQUENCE ANALYSES 401
launch the Create Motif List tool from the Navigation Area or using the Add Motif button in the
Side Panel (see section 18.10). Once your list of custom motif(s) is saved, you can click the
Manage Motifs button in the Side Panel, which will bring up the dialog shown in figure 18.26.
At the top, select a motif list by clicking the Browse ( ) button. When the motif list is selected,
its motifs are listed in the panel in the left-hand side of the dialog. The right-hand side panel
contains the motifs that will be listed in the Side Panel when you click Finish.
See section section 18.9.2 for a non-interactive option for detecting motifs.
Simple motif. Choosing this option means that you enter a simple motif, e.g.
ATGATGNNATG.
Java regular expression. See section 18.9.3.
Prosite regular expression. For proteins, you can enter different protein patterns from
the PROSITE database (protein patterns using regular expressions and describing
specific amino acid sequences). The PROSITE database contains a great number of
patterns and have been used to identify related proteins (see https://prosite.
expasy.org/cgi-bin/prosite/prosite-list.pl).
CHAPTER 18. GENERAL SEQUENCE ANALYSES 402
Figure 18.27: Specifying the options for the Motif Search tool.
Use motif list. Clicking the small button ( ) will allow you to select a saved motif list
(see section 18.10).
• Motif. If you choose to search with a simple motif, you should enter a literal string as your
motif. Ambiguous amino acids and nucleotides are allowed. Example; ATGATGNNATG. If
your motif type is Java regular expression, you should enter a regular expression according
to the syntax rules described in section 18.9.3. Press Shift + F1 key for options. For
proteins, you can search with a Prosite regular expression and you should enter a protein
pattern from the PROSITE database.
• Accuracy. If you search with a simple motif, you can adjust the accuracy of the motif to the
match on the sequence. If you type in a simple motif and let the accuracy be 80%, the motif
search algorithm runs through the input sequence and finds all subsequences of the same
length as the simple motif such that the fraction of identity between the subsequence and
the simple motif is at least 80%. A motif match is added to the sequence as an annotation
with the exact fraction of identity between the subsequence and the simple motif. If you
use a list of motifs, the accuracy applies only to the simple motifs in the list.
• Search for reverse motif. This enables searching on the negative strand on nucleotide
sequences.
• Exclude unknown regions. Genome sequence often have large regions with unknown
sequence. These regions are very often padded with N's. Ticking this checkbox will not
display hits found in N-regions.Motif search handles ambiguous characters in the way that
two residues are different if they do not have any residues in common. For example: For
nucleotides, N matches any character and R matches A,G. For proteins, X matches any
character and Z matches E,Q.
Click Next to adjust how to handle the results and then click Finish. There are multiple types of
results that can be produced:
• Create report. This will create a report with summary information about motifs found.
• Create table. This will create an overview table of all the motifs found for all the input
sequences.
CHAPTER 18. GENERAL SEQUENCE ANALYSES 403
• Add annotations to sequences. This will add an annotation to the sequence when a motif
is found (an example is shown in figure 18.28). For details on viewing annotations see
section 14.3.1.
Figure 18.28: Sequence view displaying the pattern found. The search string was 'tataaa'.
[A-Z] will match the characters A through Z (Range). You can also put single characters
between the brackets: The expression [AGT] matches the characters A, G or T.
[A-D[M-P]] will match the characters A through D and M through P (Union). You can also put
single characters between the brackets: The expression [AG[M-P]] matches the characters
A, G and M through P.
[A-M&&[H-P]] will match the characters between A and M lying between H and P (Intersection).
You can also put single characters between the brackets. The expression [A-M&&[HGTDA]]
matches the characters A through M which is H, G, T, D or A.
[ A-M] will match any character except those between A and M (Excluding). You can also
put single characters between the brackets: The expression [ AG] matches any character
except A and G.
[A-Z&&[ M-P]] will match any character A through Z except those between M and P
(Subtraction). You can also put single characters between the brackets: The expression
[A-P&&[ CG]] matches any character between A and P except C and G.
X{n} will match a repetition of an element indicated by following that element with a
numerical value or a numerical range between the curly brackets. For example, ACG{2}
matches the string ACGG and (ACG){2} matches ACGACG.
X{n,m} will match a certain number of repetitions of an element indicated by following that
element with
CHAPTER 18. GENERAL SEQUENCE ANALYSES 404
two numerical values between the curly brackets. The first number is a lower limit on the
number of repetitions and the second number is an upper limit on the number of repetitions.
For example, ACT{1,3} matches ACT, ACTT and ACTTT.
X{n,} represents a repetition of an element at least n times. For example, (AC){2,} matches
all strings ACAC, ACACAC, ACACACAC,...
The symbol restricts the search to the beginning of your sequence. For example, if you
search through a sequence with the regular expression AC, the algorithm will find a match
if AC occurs in the beginning of the sequence.
The symbol $ restricts the search to the end of your sequence. For example, if you search
through a sequence with the regular expression GT$, the algorithm will find a match if GT
occurs in the end of the sequence.
Examples
The expression [ACG][ AC]G{2} matches all strings of length 4, where the first character is A,C
or G and the second is any character except A,C and the third and fourth character is G. The
expression G.[ A]$ matches all strings of length 3 in the end of your sequence, where the first
character is C, the second any character and the third any character except A.
• Name. The name of the motif. In the result of a motif search, this name will appear as the
name of the annotation and in the result table.
• Motif. The actual motif. See section 18.9.2 for more information about the syntax of
motifs.
CHAPTER 18. GENERAL SEQUENCE ANALYSES 405
• Description. You can enter a description of the motif. In the result of a motif search, the
description will appear in the result table and will be added as a note to the annotation on
the sequence (visible in the Annotation table ( ) or by placing the mouse cursor on the
annotation).
• Type. You can enter three different types of motifs: Simple motifs, java regular expressions
or PROSITE regular expression. Read more in section 18.9.2.
The motif list can contain a mix of different types of motifs. This is practical because some
motifs can be described with the simple syntax, whereas others need the more advanced regular
expression syntax.
Instead of manually adding motifs, you can Import From Fasta File ( ). This will show a dialog
where you can select a fasta file on your computer and use this to create motifs. This will
automatically take the name, description and sequence information from the fasta file, and put it
into the motif list. The motif type will be "simple". Note that reformatting Prosite file into FASTA
format for import will fail, as only simple motifs can be imported this way and regular expressions
are not supported.
Besides adding new motifs, you can also edit and delete existing motifs in the list. To edit a
motif, either double-click the motif in the list, or select and click the Edit ( ) button at the
bottom of the view.
To delete a motif, select it and press the Delete key on the keyboard. Alternatively, click Delete
( ) in the Tool bar.
Save the motif list in the Navigation Area, and you will be able to use for Motif Search ( ) (see
section 18.9).
Chapter 19
Nucleotide analyses
Contents
19.1 Convert DNA to RNA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 406
19.2 Convert RNA to DNA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 407
19.3 Reverse complements of sequences . . . . . . . . . . . . . . . . . . . . . . . 407
19.4 Translation of DNA or RNA to protein . . . . . . . . . . . . . . . . . . . . . . 408
19.5 Find open reading frames . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 410
CLC Main Workbench offers different kinds of sequence analyses, which only apply to DNA and
RNA.
Use the arrows to add or remove sequences or sequence lists from the selected elements list.
You can select multiple DNA sequences and sequence lists for conversion. If a sequence list
contains RNA sequences, those sequences will not be converted.
Click Finish to start the tool.
406
CHAPTER 19. NUCLEOTIDE ANALYSES 407
Use the arrows to add or remove sequences or sequence lists from the selected elements list.
You can select multiple RNA sequences and sequence lists for conversion. If a selected sequence
list contains DNA sequences, those sequences will not be converted.
Click Finish to start the tool.
This will open a new view in the View Area displaying the new DNA sequence. The new sequence
is not saved automatically. To save the sequence, drag it into the Navigation Area or press Ctrl
+ S ( + S on Mac) to activate a save dialog.
This will open a new view in the View Area displaying the reverse complement of the selected
sequence. The new sequence is not saved automatically. To save the sequence, drag it into the
Navigation Area or press Ctrl + S ( + S on Mac) to activate a save dialog.
Use the arrows to add or remove sequences or sequence lists from the selected elements list.
Clicking Next generates the dialog seen in figure 19.5:
Here you have the following options:
Reading frames If you wish to translate the whole sequence, you must specify the reading frame
for the translation. If you select e.g. two reading frames, two protein sequences are
generated.
Translate CDS You can choose to translate regions marked by and CDS or ORF annotation. This
will generate a protein sequence for each CDS or ORF annotation on the sequence. The
"Extract existing translations from annotation" allows to list the amino acid CDS sequence
shown in the tool tip annotation (e.g. interstate from NCBI download) and does therefore
not represent a translation of the actual nt sequence.
Genetic code Specify the genetic code to use. Hover the mouse cursor over an item in this list to
reveal a tooltip containing the relevant translation table (figure 19.5). The translation tables
CHAPTER 19. NUCLEOTIDE ANALYSES 409
Figure 19.5: Configure the translation options. Hover the mouse cursor over a genetic code option
to reveal a tooltip containing the relevant translation table.
Stop codons result in an asterisk being inserted in the protein sequence at the corresponding
position.
Click Finish to start the tool. The newly created protein is shown, but is not saved automatically.
To save a protein sequence, drag it into the Navigation Area or press Ctrl + S ( + S on Mac) to
activate a save dialog.
The name for a coding region translation consists of the name of the input sequence followed by
the annotation type and finally the annotation name.
Translate part of a nucleotide sequence If you want to make separate translations of all the
coding regions of a nucleotide sequence, you can check the option: "Translate CDS/ORF..." in
the translation dialog (see figure 19.5).
If you want to translate a specific coding region, which is annotated on the sequence, use the
following procedure:
Open the nucleotide sequence | right-click the ORF or CDS annotation | Translate
CDS/ORF... ( )
A dialog opens to offer you the following choices (figure 19.6):
• Select a genetic code translation table Translates the ORF/CDS to protein using the
selected translation table. Hover the mouse cursor over an item in this list to reveal
CHAPTER 19. NUCLEOTIDE ANALYSES 410
a tooltip containing the relevant translation table (figure 19.5). The translation tables
are sourced from the NCBI (https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/
wprintgc.cgi).
• Extract existing translation from annotation Translates the ORF/CDS to protein using
existing translation information available in the annotation.
• Start codon
AUG Most commonly used start codon. When selected, only AUG (or ATG) codons are
used as start codons.
CHAPTER 19. NUCLEOTIDE ANALYSES 411
Figure 19.8: Configure the options for finding open reading frames. Hover the mouse cursor over
a genetic code option to reveal a tooltip containing the relevant translation table.
Any Any codon can be used as the start codon. For identification of the open reading
frames, the first possible codon in the same reading frame as the stop codon is used
as the start codon.
All the start codons in genetic code Select to use the start codons that are specific
to the genetic code specified under Genetic code.
Other Identifies open reading frames that start with one of the codons provided in the
start codon list.
• Open-ended sequence Allow ORFs to extend up to the sequence start or end not considering
the sequence context. This can be relevant when only a fragment of a sequence is analyzed,
and there may be up- or downstream start and stop codons that are not included in the
sequence. When predicting the open reading frames, stop codons are always used, but a
given start codon is only used if it is the first one after the last stop codon. Start codons
that are not preceded by a stop codon are ignored, because there may be another start
codon upstream that is not included in the sequence.
• Minimum length (codons) The minimum number of codons that must be present for an
open reading frame to be reported.
CHAPTER 19. NUCLEOTIDE ANALYSES 412
• Genetic code Specify the genetic code to use. Hover the mouse cursor over an item in this
list to reveal a tooltip containing the relevant translation table (figure 19.8). The translation
tables are sourced from the NCBI (https://www.ncbi.nlm.nih.gov/Taxonomy/
Utils/wprintgc.cgi).
• Stop codon included in annotation Include the stop codon in the open reading frame
annotations on the sequences.
Using open reading frames to find genes is a fairly simple approach which is likely to predict
genes which are not real. Setting a relatively high minimum length of the ORFs will reduce the
number of false positive predictions, but at the same time short genes may be missed (see
figure 19.9).
Figure 19.9: The first 12,000 positions of the E. coli sequence NC_000913 downloaded from
GenBank. The blue (dark) annotations are the genes while the yellow (brighter) annotations are the
ORFs with a length of at least 100 amino acids. On the positive strand around position 11,000,
a gene starts before the ORF. This is due to the use of the standard genetic code rather than the
bacterial code. This particular gene starts with CTG, which is a start codon in bacteria. Two short
genes are entirely missing, while a handful of open reading frames do not correspond to any of the
annotated genes.
Protein analyses
Contents
20.1 Protein charge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 413
20.2 Antigenicity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 415
20.3 Hydrophobicity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 416
20.3.1 Hydrophobicity graphs along sequence . . . . . . . . . . . . . . . . . . . 417
20.3.2 Bioinformatics explained: Protein hydrophobicity . . . . . . . . . . . . . . 418
20.4 Download Pfam Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 421
20.5 Pfam domain search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 421
20.6 Download 3D Protein Structure Database . . . . . . . . . . . . . . . . . . . . 422
20.7 Find and Model Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423
20.7.1 Create structure model . . . . . . . . . . . . . . . . . . . . . . . . . . . 424
20.7.2 Model structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 425
20.8 Secondary structure prediction . . . . . . . . . . . . . . . . . . . . . . . . . . 431
20.9 Protein report . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 433
20.10 Reverse translation from protein into DNA . . . . . . . . . . . . . . . . . . . 435
20.10.1 Bioinformatics explained: Reverse translation . . . . . . . . . . . . . . . 436
20.11 Proteolytic cleavage detection . . . . . . . . . . . . . . . . . . . . . . . . . . 438
20.11.1 Bioinformatics explained: Proteolytic cleavage . . . . . . . . . . . . . . . 440
CLC Main Workbench offers a number of analyses of proteins as described in this chapter.
Note that the SignalP and TMHMM plugin allows you to predict signal peptides. For more infor-
mation, please read the plugin manual at https://resources.qiagenbioinformatics.
com/manuals/signalpandtmhmm/current/User_Manual.pdf.
The TMHMM plugin allows you to predict transmembrane helix. For more information, please
read the plugin manual at http://resources.qiagenbioinformatics.com/manuals/
tmhmm/current/Tmhmm_User_Manual.pdf.
413
CHAPTER 20. PROTEIN ANALYSES 414
knowledge can be used e.g. in relation to isoelectric focusing on the first dimension of 2D-gel
electrophoresis. The isoelectric point (pI) is found where the net charge of the protein is
zero. The calculation of the protein charge does not include knowledge about any potential
post-translational modifications the protein may have.
The pKa values reported in the literature may differ slightly, thus resulting in different looking
graphs of the protein charge plot compared to other programs.
In order to calculate the protein charge:
Tools | Protein Analysis ( )| Create Protein Charge Plot ( )
This opens the dialog displayed in figure 20.1:
If a sequence was selected before running the tool, the sequence will be listed in the Selected
Elements pane of the dialog. Use the arrows to add or remove sequences or sequence lists from
the selected elements.
You can perform the analysis on several protein sequences at a time. This will result in one
output graph showing protein charge graphs for the individual proteins.
Click Finish to start the tool.
Figure 20.2 shows the electrical charges for three proteins. In the Side Panel to the right, you
can modify the layout of the graph.
See section A in the appendix for information about the graph view.
20.2 Antigenicity
CLC Main Workbench can help to identify antigenic regions in protein sequences in different ways,
using different algorithms. The algorithms provided in the Workbench, merely plot an index of
antigenicity over the sequence.
Two different methods are available:
• [Welling et al., 1985] Welling et al. used information on the relative occurrence of amino
acids in antigenic regions to make a scale which is useful for prediction of antigenic regions.
This method is better than the Hopp-Woods scale of hydrophobicity which is also used to
identify antigenic regions.
• A semi-empirical method for prediction of antigenic regions has been developed [Kolaskar
and Tongaonkar, 1990]. This method also includes information of surface accessibility
and flexibility and at the time of publication the method was able to predict antigenic
determinants with an accuracy of 75%.
Note! Similar results from the two methods can not always be expected as the two methods are
based on different training sets.
Displaying the antigenicity for a protein sequence in a plot is done in the following way:
Tools | Protein Analysis ( )| Create Antigenicity Plot ( )
This opens a dialog. The first step allows you to add or remove sequences. If you had already
selected sequences in the Navigation Area before running the tool, these will be listed in the
Selected Elements pane. Clicking Next takes you through to Step 2, which is displayed in
figure 20.3.
Figure 20.3: Step two in the Antigenicity Plot allows you to choose different antigenicity scales and
the window size.
The Window size is the width of the window where, the antigenicity is calculated. The wider the
window, the less volatile the graph. You can chose from a number of antigenicity scales. Click
Finish to start the tool. The result can be seen in figure 20.4.
See section A in the appendix for information about the graph view.
CHAPTER 20. PROTEIN ANALYSES 416
Figure 20.4: The result of the antigenicity plot calculation and the associated Side Panel.
The level of antigenicity is calculated on the basis of the different scales. The different scales
add different values to each type of amino acid. The antigenicity score is then calculated as the
sum of the values in a 'window', which is a particular range of the sequence. The window length
can be set from 5 to 25 residues. The wider the window, the less fluctuations in the antigenicity
scores.
Antigenicity graphs along the sequence can be displayed using the Side Panel. The functionality
is similar to hydrophobicity (see section 20.3.1).
20.3 Hydrophobicity
CLC Main Workbench can calculate the hydrophobicity of protein sequences in different ways,
using different algorithms (see section 20.3.2). Furthermore, hydrophobicity of sequences
can be displayed as hydrophobicity plots and as graphs along sequences. In addition, CLC
Main Workbench can calculate hydrophobicity for several sequences at the same time, and for
alignments.
Displaying the hydrophobicity for a protein sequence in a plot is done in the following way:
Tools | Protein Analysis ( )| Create Hydrophobicity Plot ( )
This opens a dialog. The first step allows you to add or remove sequences. If you had already
selected a sequence in the Navigation Area, this will be shown in the Selected Elements.Clicking
Next takes you through to Step 2, which is displayed in figure 20.5.
The Window size is the width of the window where the hydrophobicity is calculated. The wider the
window, the less volatile the graph. You can chose from a number of hydrophobicity scales which
are further explained in section 20.3.2 Click Finish to start the tool. The result can be seen in
figure 20.6.
See section A in the appendix for information about the graph view.
CHAPTER 20. PROTEIN ANALYSES 417
Figure 20.5: Step two in the Hydrophobicity Plot allows you to choose hydrophobicity scale and the
window size.
Figure 20.6: The result of the hydrophobicity plot calculation and the associated Side Panel.
hydrophobicity scores. You can choose one, two or all three options by selecting the boxes
(figure 20.8).
Figure 20.8: The different ways of displaying the hydrophobicity scores, using the Kyte-Doolittle
scale.
Coloring the letters and their background. When choosing coloring of letters or coloring of
their background, the color red is used to indicate high scores of hydrophobicity. A 'color-slider'
allows you to amplify the scores, thereby emphasizing areas with high (or low, blue) levels of
hydrophobicity. The color settings mentioned are default settings. By clicking the color bar just
below the color slider you get the option of changing color settings.
Graphs along sequences. When selecting graphs, you choose to display the hydrophobicity
scores underneath the sequence. This can be done either by a line-plot or bar-plot, or by coloring.
The latter option offers you the same possibilities of amplifying the scores as applies for coloring
of letters. The different ways to display the scores when choosing 'graphs' are displayed in
figure 20.8. Notice that you can choose the height of the graphs underneath the sequence.
Figure 20.9: Plot of hydrophobicity along the amino acid sequence. Hydrophobic regions on
the sequence have higher numbers according to the graph below the sequence, furthermore
hydrophobic regions are colored on the sequence. Red indicates regions with high hydrophobicity
and blue indicates regions with low hydrophobicity.
The hydrophobicity is calculated by sliding a fixed size window (of an odd number) over the protein
sequence. At the central position of the window, the average hydrophobicity of the entire window
is plotted (see figure 20.9).
Hydrophobicity scales Several hydrophobicity scales have been published for various uses.
Many of the commonly used hydrophobicity scales are described below.
• Kyte-Doolittle scale. The Kyte-Doolittle scale is widely used for detecting hydrophobic
regions in proteins. Regions with a positive value are hydrophobic. This scale can be used
for identifying both surface-exposed regions as well as transmembrane regions, depending
on the window size used. Short window sizes of 5-7 generally work well for predicting
putative surface-exposed regions. Large window sizes of 19-21 are well suited for finding
transmembrane domains if the values calculated are above 1.6 [Kyte and Doolittle, 1982].
These values should be used as a rule of thumb and deviations from the rule may occur.
• Engelman scale. The Engelman hydrophobicity scale, also known as the GES-scale, is
another scale which can be used for prediction of protein hydrophobicity [Engelman et al.,
1986]. As the Kyte-Doolittle scale, this scale is useful for predicting transmembrane regions
in proteins.
• Hopp-Woods scale. Hopp and Woods developed their hydrophobicity scale for identification
of potentially antigenic sites in proteins. This scale is basically a hydrophilic index where
apolar residues have been assigned negative values. Antigenic sites are likely to be
predicted when using a window size of 7 [Hopp and Woods, 1983].
• Rose scale. The hydrophobicity scale by Rose et al. is correlated to the average area of
buried amino acids in globular proteins [Rose et al., 1985]. This results in a scale which is
not showing the helices of a protein, but rather the surface accessibility.
• Janin scale. This scale also provides information about the accessible and buried amino
acid residues of globular proteins [Janin, 1979].
CHAPTER 20. PROTEIN ANALYSES 420
Table 20.1: Hydrophobicity scales. This table shows seven different hydrophobicity scales which
are generally used for prediction of e.g. transmembrane regions and antigenicity.
• Welling scale. Welling et al. used information on the relative occurrence of amino acids
in antigenic regions to make a scale which is useful for prediction of antigenic regions.
This method is better than the Hopp-Woods scale of hydrophobicity which is also used to
identify antigenic regions.
• Surface Probability. Display of surface probability based on the algorithm by [Emini et al.,
1985]. This algorithm has been used to identify antigenic determinants on the surface of
proteins.
• Chain Flexibility. Display of backbone chain flexibility based on the algorithm by [Karplus
and Schulz, 1985]. It is known that chain flexibility is an indication of a putative antigenic
determinant.
Many more scales have been published throughout the last three decades. Even though more
advanced methods have been developed for prediction of membrane spanning regions, the
simple and very fast calculations are still highly used.
Other useful resources
AAindex: Amino acid index database
http://www.genome.ad.jp/dbget/aaindex.html
CHAPTER 20. PROTEIN ANALYSES 421
• Database Choose the database to use when searching for Pfam domains.
• Significance cutoff:
CHAPTER 20. PROTEIN ANALYSES 422
Use profile's gathering cutoffs Use cutoffs specifically assigned to each family by the
curator instead of manually assigning the Significance cutoff.
Significance cutoff The E-value (expectation value) describes the number of hits one
would expect to see by chance when searching a database of a particular size.
Essentially, a hit with a low E-value is more significant than a hit with a high E-value.
By lowering the significance threshold the domain search will become more specific
and less sensitive, i.e. fewer hits will be reported but the reported hits will be more
significant on average.
• Remove overlapping matches from the same clan Perform post-processing of the results
where overlaps between hits are resolved by keeping the hit with the smallest E-value.
If annotations were added but are not initially visible on your sequences, check under the
"Annotation types" tab of the side panel settings to ensure the Region annotation type has been
checked.
Figure 20.11: Annotations (in red) that were added by the Pfam search tool.
Detailed information for each domain annotation is available in the annotation tool tip as well as
in the Annotation Table view of the sequence list.
The domain search is performed using the hmmsearch tool from the HMMER3 package version
3.4 http://hmmer.org/. Detailed information about the scores in the Region annotations added
can be found in the HMMER User Guide http://eddylab.org/software/hmmer/Userguide.pdf.
Individual domain annotations can be removed manually, if desired. See section 14.3.5.
If you are connected to a server, you will first be asked about whether you want to download the
data locally or on a server. In the next wizard step you are asked to select the download location
(see figure 20.12).
The downloaded database will be installed in the same location as local BLAST databases (e.g.
<username>/CLCdatabases) or at a server location if the tool was executed on a CLC Server.
From the wizard it is possible to select alternative locations if more than one location is available.
When new databases are released, a new version of the database can be downloaded by invoking
the tool again (the existing database will be replaced).
If needed, the Manage BLAST Databases tool can be used to inspect or delete the database
(the database is listed with the name 'ProteinStructureSequences'). You can find the tool here:
BLAST ( )| Manage BLAST Databases ( )
Note: Before running the tool, a protein structure sequence database must be downloaded
and installed using the 'Download Find Structure Database' tool (see section 20.6).
In the tool wizard step 1, select the amino acid sequence to use as query from the Navigation
Area.
In step 2, specify if the output table should be opened or saved.
The Find and Model Structure tool carries out the following steps, to find and rank available
structures representing the query sequence:
Input: Query protein sequence
The three steps carried out by the Find and Model Structure tool are described in short below.
BLAST against protein structure sequence database A local BLAST search is carried out for
the query sequence against the protein structure sequence database (see section 20.6).
BLAST hits with E-value > 0.0001 are rejected and a maximum of 2500 BLAST hits are retrieved.
Read more about BLAST in section 26.5.
Filter away low quality hits From the list of BLAST hits, entries are rejected based on the
following rules:
• PDB structures with a resolution lower than 4 Å are removed since they cannot be expected
to represent a trustworthy atomistic model.
• BLAST hits with an identity to the query sequence lower than 20 % are removed since they
most likely would result in inaccurate models.
Rank the available structures For the resulting list of available structures, each structure is
scored based on its homology to the query sequence, and the quality of the structure itself. The
Template quality score is used to rank the structures in the table, and the rank of each structure
is shown in the "Rank" column (see figure 20.13). Read more about the Template quality score
in section 20.7.2.
• Help
3. Open a 3D view (Molecule Project) with the molecules from the PDB file and open the
created sequence alignment. The sequence originating from the structure will be linked
to the structure in the 3D view, so that selections on the sequence will show up on the
structure (see section 15.4).
4. Create a model structure by mapping the query sequence onto the structure based on the
sequence alignment (see section 20.7.2). If multiple copies of the template protein chain
have been made to generate a biomolecule, all copies are modeled at the same time.
5. Open a 3D view (a Molecule Project) with the structure model shown in both backbone
and wireframe representation. The model is colored by temperature (see figure 20.14), to
indicate local model uncertainty (see section 20.7.2). Other molecules from the template
PDB file are shown in orange or yellow coloring. The created sequence alignment is also
opened and linked with the 3D views so that selections on the model sequence will show
up on the model structure (see section 15.4).
The template structure is also available from the Proteins category in the Project Tree, but
hidden in the initial view. The initial view settings are saved on the Molecule Project as "Initial
visualization", and can always be reapplied from the View Settings menu ( ) found in the
bottom right corner of the Molecule Project (see section 4.6).
If you have problems viewing 3D structures, please check your system matches the
requirements for 3D Viewers. See section 1.3.
Figure 20.14: Structure Model of CDK5_HUMAN. The atoms and backbone are colored by
temperature, showing uncertain structure in red and well defined structure in blue.
For crystal structures, the temperature factor (also called the B-factor) is given in the PDB file as
a measure of the uncertainty or disorder of each atom position. The temperature factor has the
unit Å2 , and is typically in the range [0, 100].
The temperature color scale ranges from blue (0) over white (50) to red (100) (see section
15.3.1).
For structure models created in CLC Main Workbench, the temperature factor assigned to each
atom combines three sources of positional uncertainty:
• PDB Temp. The atom position uncertainty for the template structure, represented by the
temperature factor of the backbone atoms in the template structure.
• P(alignment) The probability that the alignment of a residue in the query sequence to a
particular position on the structure is correct.
• Clash? It is evaluated if atoms in the structure model seem to clash, thereby indicating a
problem with the model.
The three aspects are combined to give a temperature value between zero and 100, as illustrated
in figure 20.15 and 20.16.
When holding the mouse over an atom, the Property Viewer in the Side Panel will show various
information about the atom. For atoms in structure models, the contributions to the assigned
temperature are listed as seen in figure 20.17.
Note: For NMR structures, the temperature factor is set to zero in the PDB file, and the "Color by
Temperature" will therefore suggest that the structure is more well determined than is actually
the case.
P(alignment) Alignment error is one of the largest causes of model inaccuracy, particularly
when the model is built from a template sharing low sequence identity (e.g. lower than 60%).
CHAPTER 20. PROTEIN ANALYSES 427
Figure 20.15: Evaluation of temperature color for backbone atoms in structure models.
Figure 20.16: Evaluation of temperature color for side chain atoms in structure models.
Figure 20.17: Information displayed in the Side Panel Property viewer for a modeled atom.
Misaligning a single amino acid by one position will cause a ca. 3.5 Å shift of its atoms from
their true positions.
The estimate of the probability that two amino acids are correctly aligned, P(alignment), is obtained
by averaging over all the possible alignments between two sequences, similar to [Knudsen and
Miyamoto, 2003].
This allows local alignment uncertainty to be detected even in similar sequences. For example
the position of the D in this alignment:
Template GGACDAEDRSTRSTACE---GG
Target GGACD---RSTRSTACEKLMGG
CHAPTER 20. PROTEIN ANALYSES 428
Clash? Clashes are evaluated separately for each atom in a side chain. If the atom is considered
to clash, it will be assigned a temperature of 100.
Note: Clashes within the modeled protein chain as well as with all other molecules in the
downloaded PDB file (except water) are considered.
Ranking structures
The protein sequence of the gene affected by the variant (the query sequence) is BLASTed against
the protein structure sequence database (section 20.6).
A template quality score is calculated for the available structures found for the query sequence.
The purpose of the score is to rank structures considering both their quality and their homology
to the query sequence.
The five descriptors contributing to the score are:
• E-value
• % Match identity
• % Coverage
Each of the five descriptors are scaled to [0,1], based on the linear functions seen in figure 20.18.
The five scaled descriptors are combined into the template quality score, weighting them to
emphasize homology over structure qualities.
Template quality score = 3 · SE-value + 3 · SIdentity + 1.5 · SCoverage + SResolution + 0.5 · SRfree
E-value is a measure of the quality of the match returned from the BLAST search. You can read
more about BLAST and E-values in section 26.5.
% Match identity is the identity between the query sequence and the BLAST hit in the matched
region. It is evaluated as
where LB is the length of the BLAST alignment of the matched region, as indicated in figure 20.19,
and "Identity in BLAST alignment" is the number of identical positions in the matched region.
% Coverage indicates how much of the query sequence has been covered by a given BLAST hit
(see figure 20.19). It is evaluated as
CHAPTER 20. PROTEIN ANALYSES 429
Figure 20.18: From the E-value, % Match identity, % Coverage, Resolution, and Free R-value, the
contributions to the "Template quality score" are determined from the linear functions shown in the
graphs.
where LG is the total length of gaps in the BLAST alignment and LQ is the length of the query
sequence.
Figure 20.19: Schematic of a query sequence matched to a BLAST hit. LQ is the length of the
query sequence, LB is the length of the BLAST alignment of the matched region, QG1-3 are gaps in
the matched region of the query sequence, HG1-2 are gaps in the matched region of the BLAST hit
sequence, LG is the total length of gaps in the BLAST alignment.
The resolution of a crystal structure is related to the size of structural features that can be
resolved from the raw experimental data.
Rfree is used to assess possible overmodeling of the experimental data.
Resolution and Rfree are only given for crystal structures. NMR structures will therefore usually
CHAPTER 20. PROTEIN ANALYSES 430
be ranked lower than crystal structures. Likewise, structures where Rfree has not been given will
tend to receive a lower rank. This often coincides with structures of older date.
Figure 20.20: Sequence alignment mapping query sequence (Query CDK5_HUMAN) to the structure
with sequence "Template(3QQJ - CYCLIN-DEPENDENT KINASE 2)", producing a structure with
sequence "Model(CDK5_HUMAN)". Examples are highlighted: 1. Identical amino acids, 2. Amino
acid changes, 3. Amino acids in query sequence not aligned to a position on the template structure,
and 4. Amino acids on the template structure, not aligned to query sequence.
• For identical amino acids (example 1 in figure 20.20) => Copy atom positions from the PDB
file. If the side chain is missing atoms in the PDB file, the side chain is rebuilt (section
20.7.2).
• For amino acid changes (example 2 in figure 20.20) => Copy backbone atom positions
from the PDB file. Model side chain atom positions to match the query sequence (section
20.7.2).
• For amino acids in the query sequence not aligned to a position on the template structure
(example 3 in figure 20.20) => No atoms are modeled. The model backbone will have a
gap at this position and a "Structure modeling" issue is raised (see section 15.1.4).
• For amino acids on the template structure, not aligned to the query sequence (example 4
in figure 20.20) => The residues are deleted from the structure and a "Structure modeling"
issue is raised (see section 15.1.4).
according to their energy. As the simulation proceeds, the selection increasingly favors the
rotamers with the lowest energy, and the algorithm converges.
A local minimization of the modeled side chains is then carried out, to reduce unfavorable
interactions with the surroundings.
Calculating the energy of a side chain rotamer
The total energy is composed of several terms:
• Statistical potential: This score accounts for interactions between the given side chain and
the local backbone, and is estimated from a database of high-resolution crystal structures.
It depends only on the rotamer and the local backbone dihedral angles φ and ψ.
• Atom interaction potential: This score is used to evaluate the interaction between a given
side chain atom and its surroundings.
• Disulfide potential: Only applies to cysteines. It follows the form used in the RASP
program [Miao et al., 2011] and serves to allow disulfide bridges between cysteine
residues. It penalizes deviations from ideal disulfide geometry. A distance filter is applied
to determine if the disulfide potential should be used, and when it is applied the atom
interaction potential between the two sulfur atoms is turned off. Note that disulfide bridges
are not formed between separate chains.
Note: The atom interaction potential considers interactions within the modeled protein
chain as well as with all other molecules in the downloaded PDB file (except water).
• Harmonic potential: This penalizes small deviations from ideal rotamers according to a
harmonic potential. This is motivated by the concept of a rotamer representing a minimum
energy state for a residue without external interactions.
With CLC Main Workbench one can predict the secondary structure of proteins very fast. Predicted
elements are alpha-helix, beta-sheet (same as beta-strand) and other regions.
Based on extracted protein sequences from the Protein Data Bank (https://www.rcsb.org/)
a hidden Markov model (HMM) was trained and evaluated for performance. Machine learning
methods have shown superior when it comes to prediction of secondary structure of proteins
[Rost, 2001]. By far the most common structures are Alpha-helices and beta-sheets which can
be predicted, and predicted structures are automatically added to the query as annotation which
later can be edited.
In order to predict the secondary structure of proteins:
Tools | Protein Analysis ( )| Predict secondary structure ( )
This opens the dialog displayed in figure 20.21:
Figure 20.21: Choosing one or more protein sequences for secondary structure prediction.
If a sequence was selected before running the tool, that sequence will be listed in the Selected
Elements pane of the dialog. Use the arrows to add or remove sequences or sequence lists from
the selected elements.
You can perform the analysis on several protein sequences at a time. This will add annotations
to all the sequences and open a view for each sequence.
Click Finish to start the tool.
After running the prediction as described above, the protein sequence will show predicted
alpha-helices and beta-sheets as annotations on the original sequence (see figure 20.22).
Each annotation will carry a tooltip note saying that the corresponding annotation is predicted
with CLC Main Workbench. Additional notes can be added through the Edit Annotation ( )
right-click mouse menu. See section 14.3.2.
Undesired alpha-helices or beta-sheets can be removed through the Delete Annotation ( )
right-click mouse menu. See section 14.3.5.
CHAPTER 20. PROTEIN ANALYSES 433
• Protein charge plot. Plot of charge as function of pH, see section 20.1.
When you have selected the relevant analyses, click Next. In the following dialogs, adjust the
parameters for the different analyses you selected. The parameters are explained in more details
in the relevant chapters or sections (mentioned in the list above).
For sequence statistics:
• Individual Statistics Layout. Comparative is disabled because reports are generated for
one protein at a time.
• Database and search type lets you choose different databases and specify the search for
full domains or fragments.
• Genetic code lets you choose a genetic code for the sequence or the database.
Figure 20.23: A protein report. There is a Table of Contents in the Side Panel that makes it easy to
browse the report.
By double clicking a graph in the output, this graph is shown in a different view (CLC Main
Workbench generates another tab). The report output and the new graph views can be saved by
dragging the tab into the Navigation Area.
The content of the tables in the report can be copy/pasted out of the program and e.g. into
Microsoft Excel. You can also Export ( ) the report in Excel format.
CHAPTER 20. PROTEIN ANALYSES 435
If a sequence was selected before running the tool, that sequence will be listed in the Selected
Elements pane of the dialog. Use the arrows to add or remove sequences or sequence lists from
the selected elements. You can translate several protein sequences at a time.
Adjust the parameters for the translation in the dialog shown in figure 20.25.
• Use random codon. This will randomly back-translate an amino acid to a codon assuming
the genetic code to be 1, but without using the codon frequency tables. Every time you
perform the analysis you will get a different result.
• Use only the most frequent codon. On the basis of the selected translation table, this
parameter/option will assign the codon that occurs most often. When choosing this option,
CHAPTER 20. PROTEIN ANALYSES 436
the results of performing several reverse translations will always be the same, contrary to
the other two options.
• Use codon based on frequency distribution. This option is a mix of the other two options.
The selected translation table is used to attach weights to each codon based on its
frequency. The codons are assigned randomly with a probability given by the weights. A
more frequent codon has a higher probability of being selected. Every time you perform
the analysis, you will get a different result. This option yields a result that is closer to the
translation behavior of the organism (assuming you choose an appropriate codon frequency
table).
• Map annotations to reverse translated sequence. If this checkbox is checked, then all
annotations on the protein sequence will be mapped to the resulting DNA sequence. In the
tooltip on the transferred annotations, there is a note saying that the annotation derives
from the original sequence.
The Codon Frequency Table is used to determine the frequencies of the codons. Select a
frequency table from the list that fits the organism you are working with. A translation table of an
organism is created on the basis of counting all the codons in the coding sequences. Every codon
in a Codon Frequency Table has its own count, frequency (per thousand) and fraction which are
calculated in accordance with the occurrences of the codon in the organism. The tables provided
were made using Codon Usage database https://www.kazusa.or.jp/codon/ that was
built on The NCBI-GenBank Flat File Release 160.0 [June 15 2007]. You can customize the list
of codon frequency tables for your installation, see Appendix I.
Click Finish to start the tool. The newly created nucleotide sequence is shown, and if the
analysis was performed on several protein sequences, there will be a corresponding number of
views of nucleotide sequences.
The Genetic Code In 1968 the Nobel Prize in Medicine was awarded to Robert W. Hol-
ley, Har Gobind Khorana and Marshall W. Nirenberg for their interpretation of the Genetic
Code (https://www.nobelprize.org/prizes/medicine/1968/summary/). The Ge-
netic Code represents translations of all 64 different codons into 20 different amino acids.
Therefore it is no problem to translate a DNA/RNA sequence into a specific protein. But due
to the degeneracy of the genetic code, several codons may code for only one specific amino
acid. This can be seen in the table below. After the discovery of the genetic code it has been
concluded that different organism (and organelles) have genetic codes which are different from
the "standard genetic code". Moreover, the amino acid alphabet is no longer limited to 20 amino
acids. The 21'st amino acid, selenocysteine, is encoded by an 'UGA' codon which is normally
CHAPTER 20. PROTEIN ANALYSES 437
a stop codon. The discrimination of a selenocysteine over a stop codon is carried out by the
translation machinery. Selenocysteines are very rare amino acids.
The table below shows the Standard Genetic Code which is the default translation table.
TTT F Phe TCT S Ser TAT Y Tyr TGT C Cys
TTC F Phe TCC S Ser TAC Y Tyr TGC C Cys
TTA L Leu TCA S Ser TAA * Ter TGA * Ter
TTG L Leu i TCG S Ser TAG * Ter TGG W Trp
Solving the ambiguities of reverse translation A particular protein follows from the translation
of a DNA sequence whereas the reverse translation need not have a specific solution according
to the Genetic Code. The Genetic Code is degenerate which means that a particular amino
acid can be translated into more than one codon. Hence there are ambiguities of the reverse
translation.
In order to solve these ambiguities of reverse translation you can define how to prioritize the
codon selection, e.g:
As an example we want to translate an alanine to the corresponding codon. Four different codons
can be used for this reverse translation; GCU, GCC, GCA or GCG. By picking either one by random
choice we will get an alanine.
The most frequent codon, coding for an alanine in E. coli is GCG, encoding 33.7% of all alanines.
Then comes GCC (25.5%), GCA (20.3%) and finally GCU (15.3%). The data are retrieved from the
Codon usage database, see below. Always picking the most frequent codon does not necessarily
give the best answer.
By selecting codons from a distribution of calculated codon frequencies, the DNA sequence
obtained after the reverse translation, holds the correct (or nearly correct) codon distribution. It
CHAPTER 20. PROTEIN ANALYSES 438
should be kept in mind that the obtained DNA sequence is not necessarily identical to the original
one encoding the protein in the first place, due to the degeneracy of the genetic code.
In order to obtain the best possible result of the reverse translation, one should use the codon
frequency table from the correct organism or a closely related species. The codon usage of the
mitochondrial chromosome are often different from the native chromosome(s), thus mitochondrial
codon frequency tables should only be used when working specifically with mitochondria.
Other useful resources
The Genetic Code at NCBI:
http://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi?mode=c
Codon usage database:
http://www.kazusa.or.jp/codon/
Wikipedia on the genetic code
http://en.wikipedia.org/wiki/Genetic_code
In the second dialog, you can select proteolytic cleavage enzymes. Presently, the list contains
the enzymes shown in figure 20.27. The full list of enzymes and their cleavage patterns can be
seen in Appendix, section B.
You can then set parameters for the detection. This limits the number of detected cleavages
(figure 20.28).
• Min. and max. number of cleavage sites. Certain proteolytic enzymes cleave at many
positions in the amino acid sequence. For instance proteinase K cleaves at nine different
amino acids, regardless of the surrounding residues. Thus, it can be very useful to limit the
number of actual cleavage sites before running the analysis.
CHAPTER 20. PROTEIN ANALYSES 439
• Min. and max. fragment length Likewise, it is possible to limit the output to only display
sequence fragments between a chosen length. Both a lower and upper limit can be chosen.
• Min. and max. fragment mass The molecular weight is not necessarily directly correlated
to the fragment length as amino acids have different molecular masses. For that reason it
is also possible to limit the search for proteolytic cleavage sites to mass-range.
For example, if you have one protein sequence but you only want to show which enzymes cut
between two and four times. Then you should select "The enzymes has more cleavage sites than
2" and select "The enzyme has less cleavage sites than 4". In the next step you should simply
select all enzymes. This will result in a view where only enzymes which cut 2,3 or 4 times are
presented.
Click Finish to start the tool. The result of the detection is displayed in figure 20.29.
Depending on the settings in the program, the output of the proteolytic cleavage site detection
will display two views on the screen. The top view shows the actual protein sequence with the
predicted cleavage sites indicated by small arrows. If no labels are found on the arrows they can
be enabled by setting the labels in the "annotation layout" in the preference panel. The bottom
view shows a text output of the detection, listing the individual fragments and information on
CHAPTER 20. PROTEIN ANALYSES 440
these.
• Signal peptides or targeting sequences are removed during translocation through a mem-
brane.
• Viral proteins that were translated from a monocistronic mRNA are cleaved.
Proteolytic cleavage of proteins has shown its importance in laboratory experiments where it is
often useful to work with specific peptide fragments instead of entire proteins.
Proteases also have commercial applications. As an example proteases can be used as
detergents for cleavage of proteinaceous stains in clothing.
CHAPTER 20. PROTEIN ANALYSES 441
The general nomenclature of cleavage site positions of the substrate were formulated by
Schechter and Berger, 1967-68 [Schechter and Berger, 1967], [Schechter and Berger, 1968].
They designate the cleavage site between P1-P1', incrementing the numbering in the N-terminal
direction of the cleaved peptide bond (P2, P3, P4, etc..). On the carboxyl side of the cleavage
site the numbering is incremented in the same way (P1', P2', P3' etc. ). This is visualized in
figure 20.30.
Figure 20.30: Nomenclature of the peptide substrate. The substrate is cleaved between position
P1-P1'.
Proteases often have a specific recognition site where the peptide bond is cleaved. As an
example trypsin only cleaves at lysine or arginine residues, but it does not matter (with a few
exceptions) which amino acid is located at position P1'(carboxyterminal of the cleavage site).
Another example is trombin which cleaves if an arginine is found in position P1, but not if a D or
E is found in position P1' at the same time. (See figure 20.31).
Figure 20.31: Hydrolysis of the peptide bond between two amino acids. Trypsin cleaves unspecifi-
cally at lysine or arginine residues whereas trombin cleaves at arginines if asparate or glutamate
is absent.
Bioinformatics approaches are used to identify potential peptidase cleavage sites. Fragments
can be found by scanning the amino acid sequence for patterns which match the corresponding
cleavage site for the protease. When identifying cleaved fragments it is relatively important to
know the calculated molecular weight and the isoelectric point.
Other useful resources
The Peptidase Database: https://www.ebi.ac.uk/merops/
Chapter 21
Contents
21.1 Importing and viewing trace data . . . . . . . . . . . . . . . . . . . . . . . . 443
21.1.1 Trace settings in the Side Panel . . . . . . . . . . . . . . . . . . . . . . 443
21.2 Trim sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 444
21.2.1 Trimming using the Trim Sequences tool . . . . . . . . . . . . . . . . . . 445
21.2.2 Manual trimming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 447
21.3 Assemble sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 447
21.4 Assemble sequences to reference . . . . . . . . . . . . . . . . . . . . . . . . 449
21.5 Sort sequences by name . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 452
21.6 Add sequences to an existing contig . . . . . . . . . . . . . . . . . . . . . . 455
21.7 View and edit contigs and read mappings . . . . . . . . . . . . . . . . . . . . 456
21.7.1 View settings in the Side Panel . . . . . . . . . . . . . . . . . . . . . . . 457
21.7.2 Editing a contig or read mapping . . . . . . . . . . . . . . . . . . . . . . 461
21.7.3 Sorting reads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 461
21.7.4 Read conflicts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 462
21.7.5 Using the mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 462
21.7.6 Extracting reads from mappings . . . . . . . . . . . . . . . . . . . . . . 462
21.7.7 Variance table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 465
21.8 Reassemble contig . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 467
21.9 Secondary peak calling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 467
21.10 Extract Consensus Sequence . . . . . . . . . . . . . . . . . . . . . . . . . . 469
CLC Main Workbench lets you import, trim and assemble DNA sequence reads from automated
sequencing machines. A number of different formats are supported (see section 7.1).
This chapter first explains how to trim sequence reads. Next follows a description of how to
assemble reads into contigs both with and without a reference sequence. In the final section,
the options for viewing and editing contigs are explained.
442
CHAPTER 21. SEQUENCING DATA ANALYSES AND ASSEMBLY 443
Figure 21.1: A tooltip displaying information about the quality of the chromatogram.
The qualities are based on the phred scoring system, with scores below 19 counted as low
quality, scores between 20 and 39 counted as medium quality, and those 40 and above counted
as high quality.
If the trace file does not contain information about quality, only the sequence length will be
shown.
To view the trace data, open the sequence read in a standard sequence view ( ).
The traces can be scaled by dragging the trace vertically as shown in figure figure 21.2. The
Workbench automatically adjust the height of the traces to be readable, but if the trace height
varies a lot, this manual scaling is very useful.
The height of the area available for showing traces can be adjusted in the Side Panel as described
insection 21.1.1.
• Nucleotide trace. For each of the four nucleotides the trace data can be selected and
unselected.
• Scale traces. A slider which allows the user to scale the height of the trace area. Scaling
the traces individually is described in section 21.1.
CHAPTER 21. SEQUENCING DATA ANALYSES AND ASSEMBLY 444
Figure 21.3: A sequence with trace data. The preferences for viewing the trace are shown in the
Side Panel.
When working with stand-alone mappings containing reads with trace data, you can view the
traces by turning on the trace setting options as described here and choosing Not compact in
the Read layout setting for the mapping.
Figure 21.4: Trimming creates annotations on the regions that will be ignored in the assembly
process.
CHAPTER 21. SEQUENCING DATA ANALYSES AND ASSEMBLY 445
• You wish to ensure that consistency when trimming. That is, you wish to ensure the same
criteria are used for all the sequences in a set.
To start up the Trim Sequences tool in the Workbench, go to the menu option:
Tools | Sanger Sequencing Analysis ( )| Trim Sequences ( )
This opens a dialog where you can choose the sequences to trim, by using the arrows to move
them between the Navigation Area and the 'Selected Elements' box.
You can then specify the trim parameters as displayed in figure 21.5.
• Ignore existing trim information. If you have previously trimmed the sequences, you can
check this to remove existing trimming annotation prior to analysis.
• Trim using quality scores. If the sequence files contain quality scores from a base caller
algorithm this information can be used for trimming sequence ends. The program uses the
modified-Mott trimming algorithm for this purpose (Richard Mott, personal communication):
CHAPTER 21. SEQUENCING DATA ANALYSES AND ASSEMBLY 446
Quality scores in the Workbench are on a Phred scale, and formats using other scales will be
converted during import. The Phred quality scores (Q), defined as: Q = −10log10(P ), where
P is the base-calling error probability, can then be used to calculate the error probabilities,
which in turn can be used to set the limit for, which bases should be trimmed.
Hence, the first step in the trim process is to convert the quality score (Q) to an error
Q
probability: perror = 10 −10 . (This now means that low values are high quality bases.)
Next, for every base a new value is calculated: Limit − perror . This value will be negative
for low quality bases, where the error probability is high.
For every base, the Workbench calculates the running sum of this value. If the sum drops
below zero, it is set to zero. The part of the sequence not trimmed will be the region
ending at the highest value of the running sum and starting at the last zero value before
this highest score. Everything before and after this region will be trimmed. A read will be
completely removed if the score never makes it above zero.
At https://resources.qiagenbioinformatics.com/testdata/trim.zip you
find an example sequence and an Excel sheet showing the calculations done for this
particular sequence to illustrate the procedure described above.
• Trim ambiguous nucleotides. This option trims the sequence ends based on the presence
of ambiguous nucleotides (typically N). Note that the automated sequencer generating the
data must be set to output ambiguous nucleotides in order for this option to apply. The
algorithm takes as input the maximal number of ambiguous nucleotides allowed in the
sequence after trimming. If this maximum is set to e.g. 3, the algorithm finds the maximum
length region containing 3 or fewer ambiguities and then trims away the ends not included
in this region. The "Trim ambiguous nucleotides" option trims all types of ambiguous
nucleotides (see Appendix F).
• Trim contamination from vectors in UniVec database. If selected, the program will match
the sequence reads against all vectors in the UniVec database and mark sequence ends
with significant matches with a 'Trim' annotation.
The UniVec database build 10.1 is included when you install the CLC Main Workbench. A
list of all the vectors in the database can be found at https://www.ncbi.nlm.nih.
gov/VecScreen/replist.html.
• Trim contamination from sequences. This option lets you use your own vector sequences
that you have imported into the CLC Main Workbench. If selected, Trim using sequences
will be enabled and you can choose one or more sequences.
• Hit limit for vector trimming. When at least one vector trimming parameter is selected, the
strictness for vector contamination trimming can be specified. Since vector contamination
usually occurs at the beginning or end of a sequence, different criteria are applied for
terminal and internal matches. A match is considered terminal if it is located within the
first 25 bases at either sequence end. Three match categories are defined according to
the expected frequency of an alignment with the same score occurring between random
sequences. The CLC Main Workbench uses the same settings as VecScreen (https:
//www.ncbi.nlm.nih.gov/tools/vecscreen/):
Weak hit limit Expect 1 random match in 40 queries of length 350 kb.
∗ Terminal match with Score 16 to 18.
CHAPTER 21. SEQUENCING DATA ANALYSES AND ASSEMBLY 447
In the last step of the wizard, you can choose to create a report, summarizing how each sequence
has been trimmed. Click Finish to start the tool. This will start the trimming process. Views
of each trimmed sequence will be shown, and you can inspect the result by looking at the
"Trim" annotations (they are colored red as default). Note that the trim annotations are used to
signal that this part of the sequence is to be ignored during further analyses, hence the trimmed
sequences are not deleted. If there are no trim annotations, the sequence has not been trimmed.
When the sequences are selected, click Next. This will show the dialog in figure 21.6
• Minimum aligned read length. The minimum number of nucleotides in a read which must
be successfully aligned to the contig. If this criteria is not met by a read, the read is
excluded from the assembly.
• Alignment stringency. Specifies the stringency (Low, Medium or High) of the scoring
function used by the alignment step in the contig assembly algorithm. A higher stringency
level will tend to produce contigs with fewer ambiguities but will also tend to omit more
sequencing reads and to generate more and shorter contigs.
• Conflicts. If there is a conflict, i.e. a position where there is disagreement about the
residue (A, C, T or G), you can specify how the contig sequence should reflect the conflict:
Vote (A, C, G, T). The conflict will be solved by counting instances of each nucleotide
and then letting the majority decide the nucleotide in the contig. In case of equality,
ACGT are given priority over one another in the stated order.
Unknown nucleotide (N). The contig will be assigned an 'N' character in all positions
with conflicts (conflicts are registered already when two nucleotides differ).
Ambiguity nucleotides (R, Y, etc.). The contig will display an ambiguity nucleotide
reflecting the different nucleotides found in the reads (nucleotide ambiguity is regis-
tered already when two nucleotides differ). For an overview of ambiguity codes, see
Appendix F.
Note, that conflicts will always be highlighted no matter which of the options you choose.
Furthermore, each conflict will be marked as annotation on the contig sequence and will be
present if the contig sequence is extracted for further analysis. As a result, the details of any
experimental heterogeneity can be maintained and used when the result of single-sequence
analyzes is interpreted. Read more about conflicts in section 21.7.4.
• Create full contigs, including trace data. This will create a contig where all the aligned
reads are displayed below the contig sequence. (You can always extract the contig
sequence without the reads later on.) For more information on how to use the contigs that
are created, see section 21.7.
CHAPTER 21. SEQUENCING DATA ANALYSES AND ASSEMBLY 449
• Show tabular view of contigs. A contig can be shown both in a graphical as well as a
tabular view. If you select this option, a tabular view of the contig will also be opened (Even
if you do not select this option, you can show the tabular view of the contig later on by
clicking Table ( ) at the bottom of the view.) For more information about the tabular view
of contigs, see section 21.7.7.
• Create only consensus sequences. This will not display a contig but will only output the
assembled contig sequences as single nucleotide sequences. If you choose this option it
is not possible to validate the assembly process and edit the contig based on the traces.
When the assembly process has ended, a number of views will be shown, each containing a
contig of two or more sequences that have been matched. If the number of contigs seem too
high or low, try again with another Alignment stringency setting. Depending on your choices of
output options above, the views will include trace files or only contig sequences. However, the
calculation of the contig is carried out the same way, no matter how the contig is displayed.
See section 21.7 on how to use the resulting contigs.
• Reference sequence. Click the Browse and select element icon ( ) in order to select one
or more sequences to use as reference(s).
• Include reference sequence(s) in contig(s). This will create a contig for each reference with
the corresponding reference sequence at the top and the aligned sequences below. This
option is useful when comparing sequence reads to a closely related reference sequence
e.g. when sequencing for SNP characterization.
Only include part of reference sequence(s) in the contig(s). If the aligned sequences
only cover a small part of a reference sequence, it may not be desirable to include the
CHAPTER 21. SEQUENCING DATA ANALYSES AND ASSEMBLY 450
Figure 21.7: Parameters for how the reference should be handled when assembling sequences to
a reference sequence.
whole reference sequence in a contig. When this option is selected, you can specify
the number of residues from reference sequences that should be included on each
side of regions spanned by aligned sequences using the Extra residues field.
• Do not include reference sequence(s) in contig(s). This will produce contigs without
any reference sequence where the input sequences have been assembled using reference
sequences as a scaffold. The input sequences are first aligned to the reference sequence(s).
Next, the consensus sequence for regions spanned by aligned sequences are extracted
and output as contigs. This option is useful when performing assembling sequences where
the reference sequences that are not closely related to the input sequencing.
When the reference sequence has been selected, click Next, to see the dialog shown in
figure 21.8
Figure 21.8: Options for how the input sequences should be aligned and how nucleotide conflicts
should be handled.
• Minimum aligned read length. The minimum number of nucleotides in a read which must
match a reference sequence. If an input sequence does not meet this criteria, the sequence
is excluded from the assembly.
CHAPTER 21. SEQUENCING DATA ANALYSES AND ASSEMBLY 451
• Alignment stringency. Specifies the stringency (Low, Medium or High) of the scoring
function used for aligning the input sequences to the reference sequence(s). A higher
stringency level often produce contigs with lower levels of ambiguity but also reduces the
ability to align distant homologs or sequences with a high error rate to reference sequences.
The result of a higher stringency level is often that the number of contigs increases and the
average length of contigs decreases while the quality of each contig increases.
The stringency settings Low, Medium and High are based on the following score values
(mt=match, ti=transition, tv=transversion, un=unknown):
Score values
Low Medium High
Match (mt) 2 2 2
Transversion (tv) -6 -10 -20
Transition (ti) -2 -6 -16
Unknown (un) -2 -6 -16
Gap -8 -16 -36
Score Matrix
A C G T N
A mt tv ti tv un
C tv mt tv ti un
G ti tv mt tv un
T tv ti tv mt un
N un un un un un
Unknown nucleotide (N). The contig will be assigned an 'N' character in all positions
with conflicts (conflicts are registered already when two nucleotides differ).
Ambiguity nucleotides (R, Y, etc.). The contig will display an ambiguity nucleotide
reflecting the different nucleotides found in the aligned sequences (nucleotide ambi-
guity is registered when two nucleotides differ). For an overview of ambiguity codes,
see Appendix F.
Vote (A, C, G, T). The conflict will be solved by counting instances of each nucleotide
and then letting the majority decide the nucleotide in the contig. In case of equality,
ACGT are given priority over one another in the stated order.
Note, that conflicts will be highlighted for all options. Furthermore, conflicts will be marked
with an annotation on each contig sequence which are preserved if the contig sequence
is extracted for further analysis. As a result, the details of any experimental heterogeneity
can be maintained and used when the result of single-sequence analyzes is interpreted.
Click Finish to start the tool. This will start the assembly process. See section 21.7 on how to
use the resulting contigs.
...
A02__Asp_F_016_2007-01-10
A02__Asp_R_016_2007-01-10
A02__Gln_F_016_2007-01-11
A02__Gln_R_016_2007-01-11
A03__Asp_F_031_2007-01-10
A03__Asp_R_031_2007-01-10
A03__Gln_F_031_2007-01-11
A03__Gln_R_031_2007-01-11
...
In this example, the names have five distinct parts (we take the first name as an example):
To start mapping these data, you probably want to have them divided into groups instead of
having all reads in one folder. If, for example, you wish to map each sample separately, or if you
wish to map each gene separately, you cannot simply run the mapping on all the sequences in
one step.
That is where Sort Sequences by Name comes into play. It will allow you to specify which part
of the name should be used to divide the sequences into groups. We will use the example
described above to show how it works:
Tools | Molecular Biology Tools ( ) | Sanger Sequencing Analysis ( ) | Sort
Sequences by Name ( )
This opens a dialog where you can add the sequences you wish to sort, by using the arrows to
move them between the Navigation Area and 'Selected Elements'. You can also add sequence
lists or the contents of an entire folder by right-clicking the folder and choose: Add folder
contents.
When you click Next, you will be able to specify the details of how the grouping should be
performed. First, you have to choose how each part of the name should be identified. There are
three options:
CHAPTER 21. SEQUENCING DATA ANALYSES AND ASSEMBLY 453
• Simple. This will simply use a designated character to split up the name. You can choose
a character from the list:
Underscore _
Dash -
Hash (number sign / pound sign) #
Pipe |
Tilde ~
Dot .
• Positions. You can define a part of the name by entering the start and end positions, e.g.
from character number 6 to 14. For this to work, the names have to be of equal lengths.
• Java regular expression. This is an option for advanced users where you can use a special
syntax to have total control over the splitting. See more below.
In the example above, it would be sufficient to use a simple split with the underscore _ character,
since this is how the different parts of the name are divided.
When you have chosen a way to divide the name, the parts of the name will be listed in the table
at the bottom of the dialog. There is a checkbox next to each part of the name. This checkbox is
used to specify which of the name parts should be used for grouping. In the example above, if
we want to group the reads according to date and analysis position, these two parts should be
checked as shown in figure 21.9.
Figure 21.9: Splitting up the name at every underscore (_) and using the date and analysis position
for grouping.
• Sequence name. This is the name of the first sequence that has been chosen. It is shown
here in the dialog in order to give you a sample of what the names in the list look like.
CHAPTER 21. SEQUENCING DATA ANALYSES AND ASSEMBLY 454
• Resulting group. The name of the group that this sequence would belong to if you proceed
with the current settings.
• Number of groups. The number of groups that would be produced when you proceed with
the current settings.
This preview cannot be changed. It is shown to guide you when finding the appropriate settings.
Click Finish to start the tool. A new sequence list will be generated for each group. It will be
named according to the group, e.g. 2004-08-24_A02 will be the name of one of the groups in the
example shown in figure 21.9.
...
adk-29_adk1n-F
adk-29_adk2n-R
adk-3_adk1n-F
adk-3_adk2n-R
adk-66_adk1n-F
adk-66_adk2n-R
atp-29_atpA1n-F
atp-29_atpA2n-R
atp-3_atpA1n-F
atp-3_atpA2n-R
atp-66_atpA1n-F
atp-66_atpA2n-R
...
In this example, we wish to group the sequences into three groups based on the number after the
"-" and before the "_" (i.e. 29, 3 and 66). The simple splitting as shown in figure 21.9 requires
the same character before and after the text used for grouping, and since we now have both a "-"
and a "_", we need to use the regular expressions instead (note that dividing by position would
not work because we have both single and double digit numbers (3, 29 and 66)).
The regular expression for doing this would be (.*)-(.*)_(.*) as shown in figure 21.10.
The round brackets () denote the part of the name that will be listed in the groups table at the
bottom of the dialog. In this example we actually did not need the first and last set of brackets,
so the expression could also have been .*-(.*)_.* in which case only one group would be
listed in the table at the bottom of the dialog.
CHAPTER 21. SEQUENCING DATA ANALYSES AND ASSEMBLY 455
Figure 21.10: Dividing the sequence into three groups based on the number in the middle of the
name.
Note that the new sequences will be added to the existing contig which will not be extended. If
the new sequences extend beyond the existing contig, they will be cut off.
Figure 21.12: The view of a contig. Controls at the bottom allow you to zoom in and out, and
settings to the right control how the mapping is displayed.
the final contig or mapping results. This may be due to trimming before or during the assembly
or to misalignment with other reads (assembly) or the reference sequence (mapping).
Simply drag the edge of the faded section to adjust the trimmed area to include more of the read
in the contig or mapping (figure 21.13).
Figure 21.13: Drag the edge of the faded area to customize how much of a read should be
considered in the mapping.
Note: Handles for dragging are only available when individual residues can be seen. For this,
zoom fully in and chose a Compactness level of "Not compact", "Low" or "Packed".
To reverse complement an entire contig or mapping, right-click in the empty white area of the
contig or mapping and choose to Reverse Complement Sequence.
Read layout.
• Compactness. Set the level of detail to be displayed. The level of compactness affects
other view settings as well as the overall view. For example: if Compact is selected,
quality scores and annotations on the reads will not be visible, even if these options
are turned on under the "Nucleotide info" palette. Compactness can also be changed
by pressing and holding the Alt key while scrolling with the mouse wheel or touchpad.
Not compact. This allows the mapping to be viewed in full detail, including quality
scores and trace data for the reads, where present. To view such information,
additional viewing options under the Nucleotide info view settings must also
selected. For further details on these, see section 21.1.1 and section 14.2.1.
Low. Hides trace data, quality scores and puts the reads' annotations on the
sequence. The editing functions available when right-clicking on a nucleotide with
compactness set to Low is shown in figure 21.15.
Medium. The labels of the reads and their annotations are hidden, and reads are
shown as lines. The residues of the reads cannot be seen, even when zoomed in
100%.
Compact. Like Medium but with less space between the reads.
CHAPTER 21. SEQUENCING DATA ANALYSES AND ASSEMBLY 458
Figure 21.14: Settings in the side panel allow customization of the view of read mappings and
contigs from assemblies.
Packed. This uses all the horizontal space available for displaying the reads
(figure 21.16). This differs from the other settings, which stack all reads vertically.
When zoomed in to 100%, the individual residues are visible. When zoomed
out, reads are represented as lines. Packed mode is useful when viewing large
amounts of data, but some functionality is not available. For example, the read
mapping cannot be edited, portions cannot be selected, and color coding changes
are not possible.
• Gather sequences at top. When selected, the sequence reads contributing to the
mapping at that position are placed right below the reference. This setting has no
effect when the compactness level is Packed.
• Show sequence ends. When selected, trimmed regions are shown (faded traces and
CHAPTER 21. SEQUENCING DATA ANALYSES AND ASSEMBLY 459
Figure 21.16: An example of the Packed compactness setting. Highlighted in black is an example
of 3 narrow vertical lines representing mismatching residues.
Residue coloring. There is one parameter in this section in addition to those described in
section 14.2.1.
• Sequence colors. This setting controls the coloring of sequences when working in
most compactness modes. The exception is Packed mode, where colors are controlled
with settings under the "Match coloring" tab, described below.
Main. The color of the consensus and reference sequence. Black by default.
Forward. The color of forward reads. Green by default.
Reverse. The color of reverse reads. Red by default.
Paired. The color of read pairs. Blue by default. Reads from broken pairs are
colored according to their orientation (forward or reverse) or as a non-specific
match, but with a darker hue than the color of ordinary reads.
Non-specific matches. When a read would have matched equally well another
place in the mapping, it is considered a non-specific match and is colored yellow
by default. Coloring to indicate a non-specific match overrules other coloring. For
mappings with several reference sequences, a read is considered a non-specific
match if it matches more than once across all the contigs/references.
Colors can be adjusted by clicking on an individual color and selecting from the palette
presented.
Alignment info. There are several parameters in this section in addition to the ones described
in section 16.2.
• Coverage: Shows how many reads are contributing information to a given position in
the read mapping. The level of coverage is relative to the overall number of reads.
• Paired distance: Plots the distance between the members of paired reads.
• Single paired reads: Plots the percentage of reads marked as single paired reads
(when only one of the reads in a pair matches).
• Non-specific matches: Plots the percentage of reads that also match other places.
• Non-perfect matches: Plots the percentage of reads that do not match perfectly.
• Spliced matches: Plots the percentage of reads that are spliced.
• Foreground color. Colors the residues using a gradient, where the left side color is
used for low coverage and the right side is used for maximum coverage.
• Background color. Colors the background of the residues using a gradient, where
the left side color is used for low coverage and the right side is used for maximum
coverage.
• Graph. Read coverage is displayed as a graph. The data points for the graph can be
exported (see section 8.3).
Height. Specifies the height of the graph.
Type. The graph can be displayed as Line plot, Bar plot or as a Color bar.
Color box. For Line and Bar plots, the color of the plot can be set by clicking the
color box. If a Color bar is chosen, the color box is replaced by a gradient color
box as described under Foreground color.
CHAPTER 21. SEQUENCING DATA ANALYSES AND ASSEMBLY 461
Match coloring Coloring of the mapped reads when the Packed compactness option is selected.
Colors can be adjusted by clicking on an individual color and selecting from the palette
presented. Coloring of bases when other compactness settings are selected is controlled
under the "Residue coloring" tab.
In the contig or mapping view, you can use Zoom in ( ) to zoom to a greater level of detail than
in other views (see figure 21.12).
Note: For contigs or mappings with more than 1,000 reads, you can only do single-residue
replacements. When the compactness is Packed, you cannot edit any of the reads.
All changes are recorded in the history of the element (see section 2.5).
• Sort Reads by Alignment Start Position. This will list the first read in the alignment at the
top etc.
• Sort Reads by Length. The shortest reads will be listed at the top.
• Conflict. Both the annotation and the corresponding row in the Table ( ) are colored red.
• Resolved. Both the annotation and the corresponding row in the Table ( ) are colored
green.
The conflict can be resolved by correcting the deviating residues in the reads as described above.
A fast way of making all the reads reflect the consensus sequence is to select the position in
the consensus, right-click the selection, and choose Transfer Selection to All Reads.
The opposite is also possible: make a selection on one of the reads, right click, and Transfer
Selection to Contig Sequence.
• Extract from Selection. Available from the right-click menu of the reference sequence or
consensus sequence (figure 21.17). A new stand-alone read mapping consisting of just
the reads that are completely covered by the selected region will be created. Options are
available to specify the nature of the extracted reads in the 'Specify reads to be included'
wizard step, see below.
CHAPTER 21. SEQUENCING DATA ANALYSES AND ASSEMBLY 463
• Extract Sequences. Available from the right-click menu of the coverage graph or a read
(figure 21.18), or from the Tools menu. It extracts all reads to a sequence list or individual
sequences. See section 18.2.
Figure 21.17: Right-click on the selected region on the reference sequence (left) or consensus
sequence (right) in a stand-alone read mapping for revealing the available options.
The 'Specify reads to be included' wizard step of Extract from Selection offers the following
options (figure 21.19):
Match specificity
• Include specific matches Reads that mapped best to just a single position of the
reference genome.
• Include non-specific matches Reads that have multiple, equally good alignments to
the reference genome. These reads are colored yellow by default in read mappings.
Alignment quality
• Include perfectly aligned reads Reads where the full read is perfectly aligned to
the reference genome. Reads that extend beyond the end of the reference are not
considered perfectly aligned, because part of the read does not match the reference.
• Include reads with less than perfect alignment Reads with mismatches, insertions or
deletions, or with unaligned ends.
Spliced status
CHAPTER 21. SEQUENCING DATA ANALYSES AND ASSEMBLY 464
Figure 21.18: Right-click on the coverage graph or reads for revealing the available options.
Paired status
• Include intact paired reads Paired reads mapped within the specified paired distance.
• Include reads from broken pairs Paired reads where only one of the reads mapped,
either because only one read in the pair matched the reference, or because the
distance or relative orientation of its mate was wrong.
• Include single reads Reads marked as single reads (as opposed to paired reads).
Reads from broken pairs are not included. Reads marked as single reads after
trimming paired sequence lists are included.
• Only include matching read(s) of read pairs If only one read of a read pair matches
the criteria, then only include the matching read as a broken pair. For example if
only one of the reads from the pair is inside the overlap region, then this option only
includes the read found within the overlap region as a broken read. When both reads
are inside the overlap region, the full paired read is included. Note that some tools
ignore broken reads by default.
Orientation
• Reference position. The position of the conflict measured from the starting point of the
reference sequence.
• Consensus position. The position of the conflict measured from the starting point of the
consensus sequence.
• Consensus residue. The consensus's residue at this position. The residue can be edited
in the graphical view, as described above.
• Other residues. Lists the residues of the reads. Inside the brackets, you can see the
number of reads having this residue at this position. In the example in figure 21.20, you
CHAPTER 21. SEQUENCING DATA ANALYSES AND ASSEMBLY 466
Figure 21.20: The graphical view is displayed at the top, and underneath the conflicts are shown
in a table. At the conflict at position 313, the user has entered a comment in the table (to see it,
make sure the Notes column is wide enough to display all text lines). This comment is now also
added to the tooltip of the conflict annotation in the graphical view above.
can see that at position 637 there is a 'C' in the top read in the graphical view. The other
two reads have a 'T'. Therefore, the table displays the following text: 'C (1), T (2)'.
• IUPAC. The ambiguity code for this position. The ambiguity code reflects the residues in
the reads - not in the consensus sequence. (The IUPAC codes can be found in section F.)
Conflict. Initially, all the rows in the table have this status. This means that there is
one or more differences between the sequences at this position.
Resolved. If you edit the sequences, e.g. if there was an error in one of the sequences,
and they now all have the same residue at this position, the status is set to Resolved.
• Note. Can be used for your own comments on this conflict. Right-click in this cell of the
table to add or edit the comments. The comments in the table are associated with the
conflict annotation in the graphical view. Therefore, the comments you enter in the table
will also be attached to the annotation on the consensus sequence (the comments can be
displayed by placing the mouse cursor on the annotation for one second - see figure 21.20).
The comments are saved when you Save ( ).
By clicking a row in the table, the corresponding position is highlighted in the graphical view.
Clicking the rows of the table is another way of navigating the contig or the mapping, as are using
the Find Conflict button or using the Space bar. You can use the up and down arrow keys to
navigate the rows of the table.
CHAPTER 21. SEQUENCING DATA ANALYSES AND ASSEMBLY 467
• De novo assembly. This will perform a normal assembly in the same way as if you had
selected the reads as individual sequences. When you click Next, you will follow the same
steps as described in section 21.3. The consensus sequence of the contig will be ignored.
• Reference assembly. This will use the consensus sequence of the contig as reference.
When you click Next, you will follow the same steps as described in section 21.4.
When you click Finish, a new contig is created, so you do not lose the information in the old
contig.
• Fraction of max peak height for calling. Adjust this value to specify how high the secondary
peak must be to be called.
• Peak slope stringency. Control how pronounced each nucleotide peak must be. Decreasing
this will detect more peaks. Increasing it will detect fewer.
• Use IUPAC code / N for ambiguous nucleotides. When a secondary peak is called, the
residue at this position can either be replaced by an N or by a ambiguity character based
on the IUPAC codes (see section F).
Clicking Next allows you to add annotations. In addition to changing the actual sequence,
annotations can be added for each base that has been called. The annotations hold information
about the fraction of the max peak height.
Click Finish to start the tool. This will start the secondary peak calling. A detailed history entry
will be added to the history specifying all the changes made to the sequence.
Secondary peaks are marked in the output sequence as can be seen in figure 21.23. When
the mouse is hovered over a secondary peak, Before and Peak ratio values are shown. The
Before value refers to the original residue that was present in the sequence, while the Peak ratio
shows the ratio between the original peak and the secondary peak signal strength values (the
base associated with the secondary peak is shown in parentheses next to the peak ratio). In the
case of figure 21.23, it can be seen that the original residue is G while the residue C yields a
secondary peak. This then results in the ambiguity character S shown in the sequence.
CHAPTER 21. SEQUENCING DATA ANALYSES AND ASSEMBLY 469
• For stand-alone read mappings: the reference sequence or a selection within it, or the
consensus sequence.
• Threshold Coverage above this value is considered high, while coverage at or below it is
considered low. Overlapping paired-end reads count as two when calculating the coverage.
A higher threshold yields a more reliable consensus but reduces completeness.
• Low coverage handling Options for handling regions with low coverage.
• Conflict resolution Options for resolving read disagreements at individual positions within
high coverage regions.
Vote The consensus is set to the most supported symbol (see Use quality score),
excluding ambiguous symbols.
CHAPTER 21. SEQUENCING DATA ANALYSES AND ASSEMBLY 470
In case of a tie, symbols are chosen in the order: A > C > G > T.
To preserve biological heterozygous variation, see Insert ambiguity codes.
Insert ambiguity codes The consensus is set to the IUPAC ambiguity code (section F)
that best reflects the variation observed in the reads.
The following options determine which symbols contribute to the ambiguity codes:
∗ Noise threshold Only symbols with support greater than this value (see Use quality
score) contribute to the ambiguity code.
∗ Minimum nucleotide count Only symbols present in at least this number of reads
contribute to the ambiguity code.
Positions where no symbol qualifies to contribute to the ambiguity code are not
included in the consensus.
The following options for annotations the consensus sequence can be configured (figure 21.25):
Figure 21.25: Options for adding annotations to the extracted consensus sequence.
• Add consensus annotations (conflicts, indels, low coverage etc.) When checked, annota-
tions are added to the consensus sequence to indicate resolved conflicts, deletions relative
to the reference, and low coverage regions, provided the Split into separate sequences
option is not selected.
For inputs containing many reads or long references, many such annotations may be
generated.
• Keep annotations already on consensus and Transfer annotations from reference When
checked, annotations present on the consensus or on the reference in the input stand-alone
read mapping are copied to the extracted consensus sequence. The copied annotations are
placed in regions corresponding to their original location in the input data, although actual
coordinates may differ. Annotations may be split if the Split into separate sequences
option is selected.
These options are not enabled for types of input other than stand-alone read mapping.
Contents
22.1 Primer design - an introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 474
22.1.1 General concept . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 474
22.1.2 Scoring primers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 476
22.2 Setting parameters for primers and probes . . . . . . . . . . . . . . . . . . . 476
22.2.1 Primer Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 476
22.3 Graphical display of primer information . . . . . . . . . . . . . . . . . . . . . 478
22.3.1 Compact information mode . . . . . . . . . . . . . . . . . . . . . . . . . 478
22.3.2 Detailed information mode . . . . . . . . . . . . . . . . . . . . . . . . . 479
22.4 Output from primer design . . . . . . . . . . . . . . . . . . . . . . . . . . . . 480
22.5 Standard PCR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 481
22.5.1 When a single primer region is defined . . . . . . . . . . . . . . . . . . . 482
22.5.2 When both forward and reverse regions are defined . . . . . . . . . . . 483
22.5.3 Standard PCR output table . . . . . . . . . . . . . . . . . . . . . . . . . 484
22.6 Nested PCR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 485
22.7 TaqMan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 486
22.8 Sequencing primers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 488
22.9 Alignment-based primer and probe design . . . . . . . . . . . . . . . . . . . . 489
22.9.1 Specific options for alignment-based primer and probe design . . . . . . 490
22.9.2 Alignment based design of PCR primers . . . . . . . . . . . . . . . . . . 491
22.9.3 Alignment-based TaqMan probe design . . . . . . . . . . . . . . . . . . . 492
22.10 Analyze primer properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 494
22.11 Find binding sites and create fragments . . . . . . . . . . . . . . . . . . . . . 495
22.11.1 Binding parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 496
22.11.2 Results - binding sites and fragments . . . . . . . . . . . . . . . . . . . 497
22.12 Order primers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 499
CLC Main Workbench offers graphically and algorithmically advanced design of primers and probes
for various purposes. This chapter begins with a brief introduction to the general concepts of the
primer designing process. Then follows instructions on how to adjust parameters for primers,
how to inspect and interpret primer properties graphically and how to interpret, save and analyze
473
CHAPTER 22. PRIMERS AND PROBES 474
the output of the primer design analysis. After a description of the different reaction types for
which primers can be designed, the chapter closes with sections on how to match primers with
other sequences and how to create a primer order.
Figure 22.1: The initial view of the sequence used for primer design.
Figure 22.2: Right-click menu allowing you to specify regions for the primer design
representing the primer of interest. A tool-tip will then appear on screen, displaying detailed
information about the primer in relation to the set criteria. To locate the primer on the sequence,
simply left-click the circle using the mouse.
The various primer parameters can now be varied to explore their effect and the view area will
dynamically update to reflect this allowing for a high degree of interactivity in the primer design
process.
After having explored the potential primers the user may have found a satisfactory primer and
choose to export this directly from the view area using a mouse right-click on the primers
information point. This does not allow for any design information to enter concerning the
properties of primer/probe pairs or sets e.g. primer pair annealing and Tm difference between
primers. If the latter is desired the user can use the Calculate button at the bottom of the Primer
parameter preference group. This will activate a dialog, the contents of which depends on the
chosen mode. Here, the user can set primer-pair specific setting such as allowed or desired Tm
difference and view the single-primer parameters which were chosen in the Primer parameters
preference group.
Upon pressing finish, an algorithm will generate all possible primer sets and rank these based
on their characteristics and the chosen parameters. A list will appear displaying the 100 most
high scoring sets and information pertaining to these. The search result can be saved to the
navigator. From the result table, suggested primers or primer/probe sets can be explored since
clicking an entry in the table will highlight the associated primers and probes on the sequence.
It is also possible to save individual primers or sets from the table through the mouse right-click
menu. For a given primer pair, the amplified PCR fragment can also be opened or saved using
the mouse right-click menu.
CHAPTER 22. PRIMERS AND PROBES 476
Figure 22.3: The two groups of primer parameters (in the program, the Primer information group is
listed below the other group).
• Length. Determines the length interval within which primers can be designed by setting a
maximum and a minimum length. The upper and lower lengths allowed by the program are
50 and 10 nucleotides respectively.
• Melting temperature. Determines the temperature interval within which primers must lie.
When the Nested PCR or TaqMan reaction type is chosen, the first pair of melting tempera-
ture interval settings relate to the outer primer pair i.e. not the probe. Melting temperatures
are calculated by a nearest-neighbor model which considers stacking interactions between
CHAPTER 22. PRIMERS AND PROBES 477
neighboring bases in the primer-template complex. The model uses state-of-the-art thermo-
dynamic parameters [SantaLucia, 1998] and considers the important contribution from the
dangling ends that are present when a short primer anneals to a template sequence [Bom-
marito et al., 2000]. A number of parameters can be adjusted concerning the reaction
mixture and which influence melting temperatures (see below). Melting temperatures are
corrected for the presence of monovalent cations using the model of [SantaLucia, 1998]
and temperatures are further corrected for the presence of magnesium, deoxynucleotide
triphosphates (dNTP) and dimethyl sulfoxide (DMSO) using the model of [von Ahsen et al.,
2001].
• Inner melting temperature. This option is only activated when the Nested PCR or TaqMan
mode is selected. In Nested PCR mode, it determines the allowed melting temperature
interval for the inner/nested pair of primers, and in TaqMan mode it determines the allowed
temperature interval for the TaqMan probe.
Secondary structure. Determines the maximum score of the optimal secondary DNA
structure found for a primer or probe. Secondary structures are scored by the number
of hydrogen bonds in the structure, and 2 extra hydrogen bonds are added for each
stacking base-pair in the structure.
• 3' end G/C restrictions. When this checkbox is selected it is possible to specify restrictions
concerning the number of G and C molecules in the 3' end of primers and probes. A low
G/C content of the primer/probe 3' end increases the specificity of the reaction. A high
G/C content facilitates a tight binding of the oligo to the template but also increases the
possibility of mispriming. Unfolding the preference groups yields the following options:
End length. The number of consecutive terminal nucleotides for which to consider the
C/G content
Max no. of G/C. The maximum number of G and C nucleotides allowed within the
specified length interval
Min no. of G/C. The minimum number of G and C nucleotides required within the
specified length interval
• 5' end G/C restrictions. When this checkbox is selected it is possible to specify restrictions
concerning the number of G and C molecules in the 5' end of primers and probes. A high
G/C content facilitates a tight binding of the oligo to the template but also increases the
possibility of mis-priming. Unfolding the preference groups yields the same options as
described above for the 3' end.
• Mode. Specifies the reaction type for which primers are designed:
Standard PCR. Used when the objective is to design primers, or primer pairs, for PCR
amplification of a single DNA fragment.
Nested PCR. Used when the objective is to design two primer pairs for nested PCR
amplification of a single DNA fragment.
Sequencing. Used when the objective is to design primers for DNA sequencing.
TaqMan. Used when the objective is to design a primer pair and a probe for TaqMan
quantitative PCR.
• Calculate. Pushing this button will activate the algorithm for designing primers
The number of information lines reflects the chosen length interval for primers and probes. One
line is shown for every possible primer-length, if the length interval is widened more lines will
appear. At each potential primer starting position a circle is shown which indicates whether the
primer fulfills the requirements set in the primer parameters preference group. A green primer
indicates a primer which fulfils all criteria and a red primer indicates a primer which fails to meet
one or more of the set criteria. For more detailed information, place the mouse cursor over the
circle representing the primer of interest. A tool-tip will then appear on screen displaying detailed
information about the primer in relation to the set criteria. To locate the primer on the sequence,
simply left-click the circle using the mouse.
The various primer parameters can now be varied to explore their effect and the view area will
dynamically update to reflect this. If e.g. the allowed melting temperature interval is widened
more green circles will appear indicating that more primers now fulfill the set requirements and
if e.g. a requirement for 3' G/C content is selected, rec circles will appear at the starting points
of the primers which fail to meet this requirement.
The number of information-line-groups reflects the chosen length interval for primers and probes.
One group is shown for every possible primer length. Within each group, a line is shown for every
primer property that is selected from the checkboxes in the primer information preference group.
Primer properties are shown at each potential primer starting position and are of two types:
Properties with numerical values are represented by bar plots. A green bar represents the starting
point of a primer that meets the set requirement and a red bar represents the starting point of a
primer that fails to meet the set requirement:
• G/C content
• Melting temperature
Properties with Yes - No values. If a primer meets the set requirement a green circle will be
shown at its starting position and if it fails to meet the requirement a red dot is shown at its
starting position:
Common to both sorts of properties is that mouse clicking an information point (filled circle or
bar) will cause the region covered by the associated primer to be selected on the sequence.
Saving primers Primer solutions in a table row can be saved by selecting the row and using the
right-click mouse menu. This opens a dialog that allows the user to save the primers to the
desired location. Primers and probes are saved as DNA sequences in the program. This means
that all available DNA analyzes can be performed on the saved primers. Furthermore, the primers
can be edited using the standard sequence view to introduce e.g. mutations and restriction sites.
Saving PCR fragments The PCR fragment generated from the primer pair in a given table row can
also be saved by selecting the row and using the right-click mouse menu. This opens a dialog
that allows the user to save the fragment to the desired location. The fragment is saved as a
DNA sequence and the position of the primers is added as annotation on the sequence. The
fragment can then be used for further analysis and included in e.g. an in-silico cloning experiment
using the cloning editor.
Adding primer binding annotation You can add an annotation to the template sequence specifying
the binding site of the primer: Right-click the primer in the table and select Mark primer annotation
on sequence.
It is also possible to define a Region to amplify in which case a forward- and a reverse primer
region are automatically placed so as to ensure that the designated region will be included in the
PCR fragment. If areas are known where primers must not bind (e.g. repeat rich areas), one or
more No primers here regions can be defined.
If two regions are defined, it is required that at least a part of the Forward primer region is located
upstream of the Reverse primer region.
After exploring the available primers (see section 22.3) and setting the desired parameter values
in the Primer Parameters preference group, the Calculate button will activate the primer design
algorithm.
Figure 22.7: Calculation dialog for PCR primers when only a single primer region has been defined.
The top part of this dialog shows the parameter settings chosen in the Primer parameters
preference group which will be used by the design algorithm.
Mispriming: The lower part contains a menu where the user can choose to include mispriming as
an exclusion criteria in the design process. If this option is selected the algorithm will search for
competing binding sites of the primer within the rest of the sequence, to see if the primer would
match to multiple locations. If a competing site is found (according to the parameters set), the
primer will be excluded.
The adjustable parameters for the search are:
• Exact match. Choose only to consider exact matches of the primer, i.e. all positions must
base pair with the template for mispriming to occur.
• Minimum number of base pairs required for a match. How many nucleotides of the primer
that must base pair to the sequence in order to cause mispriming.
CHAPTER 22. PRIMERS AND PROBES 483
• Number of consecutive base pairs required in 3' end. How many consecutive 3' end base
pairs in the primer that MUST be present for mispriming to occur. This option is included
since 3' terminal base pairs are known to be essential for priming to occur.
Note! Including a search for potential mispriming sites will prolong the search time substantially
if long sequences are used as template and if the minimum number of base pairs required for
a match is low. If the region to be amplified is part of a very long molecule and mispriming is a
concern, consider extracting part of the sequence prior to designing primers.
Figure 22.8: Calculation dialog for PCR primers when two primer regions have been defined.
Again, the top part of this dialog shows the parameter settings chosen in the Primer parameters
preference group which will be used by the design algorithm. The lower part again contains a
menu where the user can choose to include mispriming of both primers as a criteria in the design
process (see section 22.5.1). The central part of the dialog contains parameters pertaining to
primer pairs. Here three parameters can be set:
• Maximum percentage point difference in G/C content - if this is set at e.g. 5 points a pair
of primers with 45% and 49% G/C nucleotides, respectively, will be allowed, whereas a pair
of primers with 45% and 51% G/C nucleotides, respectively will not be included.
• Max hydrogen bonds between pairs - the maximum number of hydrogen bonds allowed
between the forward and the reverse primer in a primer pair.
CHAPTER 22. PRIMERS AND PROBES 484
• Max hydrogen bonds between pair ends - the maximum number of hydrogen bonds allowed
in the consecutive ends of the forward and the reverse primer in a primer pair.
• Maximum length of amplicon - determines the maximum length of the PCR fragment.
• Score - measures how much the properties of the primer (or primer pair) deviates from the
optimal solution in terms of the chosen parameters and tolerances. The higher the score,
the better the solution. The scale is from 0 to 100.
• Self annealing - the maximum self annealing score of the primer in units of hydrogen bonds
• Self annealing alignment - a visualization of the highest maximum scoring self annealing
alignment
• Self end annealing - the maximum score of consecutive end base-pairings allowed between
the ends of two copies of the same molecule in units of hydrogen bonds
• Secondary structure score - the score of the optimal secondary DNA structure found for
the primer. Secondary structures are scored by adding the number of hydrogen bonds in
the structure, and 2 extra hydrogen bonds are added for each stacking base-pair in the
structure
• Secondary structure - a visualization of the optimal DNA structure found for the primer
If both a forward and a reverse region are selected a table of primer pairs is shown, where
the above columns (excluding the score) are represented twice, once for the forward primer
(designated by the letter F) and once for the reverse primer (designated by the letter R).
Before these, and following the score of the primer pair, are the following columns pertaining to
primer pair-information available:
• Pair annealing - the number of hydrogen bonds found in the optimal alignment of the forward
and the reverse primer in a primer pair
• Pair annealing alignment - a visualization of the optimal alignment of the forward and the
reverse primer in a primer pair.
• Pair end annealing - the maximum score of consecutive end base-pairings found between
the ends of the two primers in the primer pair, in units of hydrogen bonds
• Fragment length - the length (number of nucleotides) of the PCR fragment generated by the
primer pair
CHAPTER 22. PRIMERS AND PROBES 485
The top and bottom parts of this dialog are identical to the Standard PCR dialog for designing
primer pairs described above.
The central part of the dialog contains parameters pertaining to primer pairs and the comparison
between the outer and the inner pair. Here five options can be set:
• Maximum percentage point difference in G/C content (described above under Standard
PCR) - this criteria is applied to both primer pairs independently.
• Maximum pair annealing score - the maximum number of hydrogen bonds allowed between
the forward and the reverse primer in a primer pair. This criteria is applied to all possible
combinations of primers.
• Minimum difference in the melting temperature of primers in the inner and outer primer
pair - all comparisons between the melting temperature of primers from the two pairs must
be at least this different, otherwise the primer set is excluded. This option is applied
to ensure that the inner and outer PCR reactions can be initiated at different annealing
temperatures. Please note that to ensure flexibility there is no directionality indicated when
setting parameters for melting temperature differences between inner and outer primer
pair, i.e. it is not specified whether the inner pair should have a lower or higher Tm . Instead
this is determined by the allowed temperature intervals for inner and outer primers that are
set in the primer parameters preference group in the side panel. If a higher Tm of inner
primers is desired, choose a Tm interval for inner primers which has higher values than the
interval for outer primers.
• Two radio buttons allowing the user to choose between a fast and an accurate algorithm
for primer prediction.
Nested PCR output table In nested PCR there are four primers in a solution, forward outer primer
(FO), forward inner primer (FI), reverse inner primer (RI) and a reverse outer primer (RO).
The output table can show primer-pair combination parameters for all four combinations of
primers and single primer parameters for all four primers in a solution (see section on Standard
PCR for an explanation of the available primer-pair and single primer information).
The fragment length in this mode refers to the length of the PCR fragment generated by the inner
primer pair, and this is also the PCR fragment which can be exported.
22.7 TaqMan
CLC Main Workbench allows the user to design primers and probes for TaqMan PCR applications.
TaqMan probes are oligonucleotides that contain a fluorescent reporter dye at the 5' end and a
quenching dye at the 3' end. Fluorescent molecules become excited when they are irradiated and
usually emit light. However, in a TaqMan probe the energy from the fluorescent dye is transferred
to the quencher dye by fluorescence resonance energy transfer as long as the quencher and the
dye are located in close proximity i.e. when the probe is intact. TaqMan probes are designed
CHAPTER 22. PRIMERS AND PROBES 487
to anneal within a PCR product amplified by a standard PCR primer pair. If a TaqMan probe is
bound to a product template, the replication of this will cause the Taq polymerase to encounter
the probe. Upon doing so, the 5'exonuclease activity of the polymerase will cleave the probe.
This cleavage separates the quencher and the dye, and as a result the reporter dye starts to
emit fluorescence.
The TaqMan technology is used in Real-Time quantitative PCR. Since the accumulation of
fluorescence mirrors the accumulation of PCR products it can can be monitored in real-time and
used to quantify the amount of template initially present in the buffer.
The technology is also used to detect genetic variation such as SNP's. By designing a TaqMan
probe which will specifically bind to one of two or more genetic variants it is possible to detect
genetic variants by the presence or absence of fluorescence in the reaction.
A specific requirement of TaqMan probes is that a G nucleotide can not be present at the 5' end
since this will quench the fluorescence of the reporter dye. It is recommended that the melting
temperature of the TaqMan probe is about 10 degrees celsius higher than that of the primer pair.
Primer design for TaqMan technology involves designing a primer pair and a TaqMan probe.
In TaqMan the user must thus define three regions: a Forward primer region, a Reverse primer
region, and a TaqMan probe region. The easiest way to do this is to designate a TaqMan
primer/probe region spanning the sequence region where TaqMan amplification is desired. This
will automatically add all three regions to the sequence. If more control is desired about the
placing of primers and probes the Forward primer region, Reverse primer region and TaqMan
probe region can all be defined manually. If areas are known where primers or probes must not
bind (e.g. repeat rich areas), one or more No primers here regions can be defined. The regions
are defined by making a selection on the sequence and right-clicking the selection.
It is required that at least a part of the Forward primer region is located upstream of the TaqMan
Probe region, and that the TaqMan Probe region, is located upstream of a part of the Reverse
primer region.
In TaqMan mode the Inner melting temperature menu in the primer parameters panel is activated
allowing the user to set a separate melting temperature interval for the TaqMan probe.
After exploring the available primers (see section 22.3) and setting the desired parameter values
in the Primer Parameters preference group, the Calculate button will activate the primer design
algorithm.
After pressing the Calculate button a dialog will appear (see figure 22.10) which is similar to the
Nested PCR dialog described above (see section 22.6).
In this dialog the options to set a minimum and a desired melting temperature difference between
outer and inner refers to primer pair and probe respectively.
Furthermore, the central part of the dialog contains an additional parameter
• Maximum length of amplicon - determines the maximum length of the PCR fragment
generated in the TaqMan analysis.
TaqMan output table In TaqMan mode there are two primers and a probe in a given solution,
forward primer (F), reverse primer (R) and a TaqMan probe (TP).
The output table can show primer/probe-pair combination parameters for all three combinations
CHAPTER 22. PRIMERS AND PROBES 488
of primers and single primer parameters for both primers and the TaqMan probe (see section on
Standard PCR for an explanation of the available primer-pair and single primer information).
The fragment length in this mode refers to the length of the PCR fragment generated by the
primer pair, and this is also the PCR fragment which can be exported.
For each solution, the single primer information described under Standard PCR is available in the
table.
Figure 22.12: The initial view of an alignment used for primer design.
CHAPTER 22. PRIMERS AND PROBES 490
The workflow when designing alignment based primers and probes is as follows (see figure 22.13):
Figure 22.13: The initial view of an alignment used for primer design.
• Use selection boxes to specify groups of included and excluded sequences. To select all
the sequences in the alignment, right-click one of the selection boxes and choose Mark
All.
• Mark either a single forward primer region, a single reverse primer region or both on the
sequence (and perhaps also a TaqMan region). Selections must cover all sequences in
the included group. You can also specify that there should be no primers in a region (No
Primers Here) or that a whole region should be amplified (Region to Amplify).
CHAPTER 22. PRIMERS AND PROBES 491
• Perfect match. Specifies that the designed primers must have a perfect match to all
relevant sequences in the alignment. When selected, primers will thus only be located
in regions that are completely conserved within the sequences belonging to the included
group.
• Allow degeneracy. Designs primers that may include ambiguity characters where hetero-
geneities occur in the included template sequences. The allowed fold of degeneracy is
user defined and corresponds to the number of possible primer combinations formed by
a degenerate primer. Thus, if a primer covers two 4-fold degenerate site and one 2-fold
degenerate site the total fold of degeneracy is 4 ∗ 4 ∗ 2 = 32 and the primer will, when
supplied from the manufacturer, consist of a mixture of 32 different oligonucleotides. When
scoring the available primers, degenerate primers are given a score which decreases with
the fold of degeneracy.
• Allow mismatches. Designs primers which are allowed a specified number of mismatches
to the included template sequences. The melting temperature algorithm employed includes
the latest thermodynamic parameters for calculating Tm when single-base mismatches
occur.
When in Standard PCR mode, clicking the Calculate button will prompt the dialog shown in
figure 22.14.
The top part of this dialog shows the single-primer parameter settings chosen in the Primer
parameters preference group which will be used by the design algorithm.
The central part of the dialog contains parameters pertaining to primer specificity (this is omitted
if all sequences belong to the included group). Here, three parameters can be set:
• Minimum number of mismatches - the minimum number of mismatches that a primer must
have against all sequences in the excluded group to ensure that it does not prime these.
• Minimum number of mismatches in 3' end - the minimum number of mismatches that a
primer must have in its 3' end against all sequences in the excluded group to ensure that
it does not prime these.
CHAPTER 22. PRIMERS AND PROBES 492
• Length of 3' end - the number of consecutive nucleotides to consider for mismatches in the
3' end of the primer.
The lower part of the dialog contains parameters pertaining to primer pairs (this is omitted when
only designing a single primer). Here, three parameters can be set:
• Maximum percentage point difference in G/C content - if this is set at e.g. 5 points a pair
of primers with 45% and 49% G/C nucleotides, respectively, will be allowed, whereas a pair
of primers with 45% and 51% G/C nucleotides, respectively will not be included.
• Max hydrogen bonds between pairs - the maximum number of hydrogen bonds allowed
between the forward and the reverse primer in a primer pair.
• Maximum length of amplicon - determines the maximum length of the PCR fragment.
The output of the design process is a table of single primers or primer pairs as described for
primer design based on single sequences. These primers are specific to the included sequences
in the alignment according to the criteria defined for specificity. The only novelty in the table, is
that melting temperatures are displayed with both a maximum, a minimum and an average value
to reflect that degenerate primers or primers with mismatches may have heterogeneous behavior
on the different templates in the group of included sequences.
Figure 22.14: Calculation dialog shown when designing alignment based PCR primers.
sequences but not match the included sequences. As above, the selection boxes are used to
indicate the status of a sequence, if the box is checked the sequence belongs to the included
sequences, if not, it belongs to the excluded sequences. We use the terms included and excluded
here to be consistent with the section above although a probe solution is presented for both
groups. In TaqMan mode, primers are not allowed degeneracy or mismatches to any template
sequence in the alignment, variation is only allowed/required in the TaqMan probes.
Pushing the Calculate button will cause the dialog shown in figure 22.15 to appear.
The top part of this dialog is identical to the Standard PCR dialog for designing primer pairs
described above.
The central part of the dialog contains parameters to define the specificity of TaqMan probes.
Two parameters can be set:
• Minimum number of mismatches - the minimum total number of mismatches that must
exist between a specific TaqMan probe and all sequences which belong to the group not
recognized by the probe.
The lower part of the dialog contains parameters pertaining to primer pairs and the comparison
between the outer oligos(primers) and the inner oligos (TaqMan probes). Here, five options can
be set:
• Maximum percentage point difference in G/C content (described above under Standard
PCR).
• Maximum pair annealing score - the maximum number of hydrogen bonds allowed between
the forward and the reverse primer in an oligo pair. This criteria is applied to all possible
combinations of primers and probes.
• Minimum difference in the melting temperature of primer (outer) and TaqMan probe (inner)
oligos - all comparisons between the melting temperature of primers and probes must be
at least this different, otherwise the solution set is excluded.
• Desired temperature difference in melting temperature between outer (primers) and inner
(TaqMan) oligos - the scoring function discounts solution sets which deviate greatly from
this value. Regarding this, and the minimum difference option mentioned above, please
note that to ensure flexibility there is no directionality indicated when setting parameters
for melting temperature differences between probes and primers, i.e. it is not specified
whether the probes should have a lower or higher Tm . Instead this is determined by
the allowed temperature intervals for inner and outer oligos that are set in the primer
parameters preference group in the side panel. If a higher Tm of probes is required, choose
a Tm interval for probes which has higher values than the interval for outer primers.
CHAPTER 22. PRIMERS AND PROBES 494
The output of the design process is a table of solution sets. Each solution set contains the
following: a set of primers which are general to all sequences in the alignment, a TaqMan
probe which is specific to the set of included sequences (sequences where selection boxes are
checked) and a TaqMan probe which is specific to the set of excluded sequences (marked by
*). Otherwise, the table is similar to that described above for TaqMan probe prediction on single
sequences.
Figure 22.15: Calculation dialog shown when designing alignment based TaqMan probes.
In the Concentrations panel a number of parameters can be specified concerning the reaction
mixture and which influence melting temperatures
In the Template panel the sequences of the chosen primer and the template sequence are shown.
The template sequence is as default set to the reverse complement of the primer sequence i.e.
as perfectly base-pairing. However, it is possible to edit the template to introduce mismatches
which may affect the melting temperature. At each side of the template sequence a text field is
shown. Here, the dangling ends of the template sequence can be specified. These may have an
important affect on the melting temperature [Bommarito et al., 2000]
Click Finish to start the tool. The result is shown in figure 22.17:
In the Side Panel you can specify the information to display about the primer. The information
parameters of the primer properties table are explained in section 22.5.3.
At the top, select one or more primers by clicking the browse ( ) button. In CLC Main Workbench,
primers are just DNA sequences like any other, but there is a filter on the length of the sequence.
Only sequences up to 400 bp can be added.
The Match criteria for matching a primer to a sequence are:
• Exact match. Choose only to consider exact matches of the primer, i.e. all positions must
base pair with the template.
• Minimum number of base pairs required for a match. How many nucleotides of the primer
that must base pair to the sequence in order to cause priming/mispriming.
• Number of consecutive base pairs required in 3' end. How many consecutive 3' end base
pairs in the primer that MUST be present for priming/mispriming to occur. This option is
included since 3' terminal base pairs are known to be essential for priming to occur.
Note that the number of mismatches is reported in the output, so you will be able to filter on this
afterwards (see below).
Below the match settings, you can adjust Concentrations concerning the reaction mixture. This
is used when reporting melting temperatures for the primers.
Figure 22.19: Output options include reporting of binding sites and fragments.
• Add binding site annotations. This will add annotations to the input sequences (see details
below).
• Create binding site table. Creates a table of all binding sites. Described in details below.
• Create fragment table. Showing a table of all fragments that could result from using the
primers. Note that you can set the minimum and maximum sizes of the fragments to be
shown. The table is described in detail below.
• Sequence of the primer. Positions with mismatches will be in lower-case (see the fourth
position in figure 22.20 where the primer has an a and the template sequence has a T).
CHAPTER 22. PRIMERS AND PROBES 498
• Number of mismatches.
• Number of other hits on the same sequence. This number can be useful to check specificity
of the primer.
• Binding region. This region ends with the 3' exact match and is simply the primer length
upstream. This means that if you have 5' extensions to the primer, part of the binding
region covers sequence that will actually not be annealed to the primer.
The information here is the same as in the primer annotation and furthermore you can see
additional information about melting temperature etc. by selecting the options in the Side Panel.
See a more detailed description of this information in section 22.5.3. You can use this table
to browse the binding sites. If you make a split view of the table and the sequence (see
section 2.1.4), you can browse through the binding positions by clicking in the table. This will
cause the sequence view to jump to the position of the binding site.
An example of a fragment table is shown in figure 22.22.
Figure 22.22: A table showing all possible fragments of the specified size.
The table first lists the names of the forward and reverse primers, then the length of the fragment
and the region. The last column tells if there are other possible fragments fulfilling the length
CHAPTER 22. PRIMERS AND PROBES 499
criteria on this sequence. This information can be used to check for competing products in the
PCR. In the Side Panel you can show information about melting temperature for the primers as
well as the difference between melting temperatures.
You can use this table to browse the fragment regions. If you make a split view of the table and
the sequence (see section 2.1.4), you can browse through the fragment regions by clicking in the
table. This will cause the sequence view to jump to the start position of the fragment.
There are some additional options in the fragment table. First, you can annotate the fragment on
the original sequence. This is done by right-clicking (Ctrl-click on Mac) the fragment and choose
Annotate Fragment as shown in figure 22.23.
Figure 22.23: Right-clicking a fragment allows you to annotate the region on the input sequence or
open the fragment as a new sequence.
This will put a PCR fragment annotations on the input sequence covering the region specified in
the table. As you can see from figure 22.23, you can also choose to Open Fragment. This will
create a new sequence representing the PCR product that would be the result of using these two
primers. Note that if you have extensions on the primers, they will be used to construct the new
sequence.
If you are doing restriction cloning using primers with restriction site extensions, you can use this
functionality to retrieve the PCR fragment for us in the cloning editor (see section 23.3).
Contents
23.1 Restriction site analyses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 501
23.1.1 Dynamic restriction sites . . . . . . . . . . . . . . . . . . . . . . . . . . 502
23.1.2 Restriction Site Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 507
23.1.3 Insert restriction site . . . . . . . . . . . . . . . . . . . . . . . . . . . . 510
23.2 Restriction enzyme lists . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 510
23.3 Restriction Based Cloning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 512
23.3.1 Introduction to the Cloning Editor . . . . . . . . . . . . . . . . . . . . . . 513
23.3.2 The restriction cloning workflow . . . . . . . . . . . . . . . . . . . . . . . 514
23.3.3 Manual cloning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 516
23.4 Homology Based Cloning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 520
23.4.1 Working with homology based cloning . . . . . . . . . . . . . . . . . . . 521
23.4.2 Adjust the homology based cloning design . . . . . . . . . . . . . . . . . 522
23.4.3 Homology Based Cloning outputs . . . . . . . . . . . . . . . . . . . . . . 524
23.4.4 Detailed description of the Homology Based Cloning wizard . . . . . . . . 525
23.4.5 Working with mutations . . . . . . . . . . . . . . . . . . . . . . . . . . . 529
23.5 Gateway cloning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 529
23.5.1 Add attB sites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 530
23.5.2 Create entry clones (BP) . . . . . . . . . . . . . . . . . . . . . . . . . . . 532
23.5.3 Create expression clones (LR) . . . . . . . . . . . . . . . . . . . . . . . 534
23.6 Gel electrophoresis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 535
23.6.1 Gel view . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 536
CLC Main Workbench offers graphically advanced in silico cloning and design of vectors, together
with restriction enzyme analysis and functionalities for managing lists of restriction enzymes.
501
CHAPTER 23. CLONING AND RESTRICTION SITES 502
• In many cases, the dynamic restriction sites found in the Side Panel of sequence views is
the fastest and easiest way of showing restriction sites.
• Under the Tools menu you will find the Restriction Sites Analysis tool, which provides more
control over the analysis and more output options, such as a table of restriction sites. It
also allows you to perform the same restriction map analysis on several sequences in one
step.
The color of the restriction enzyme can be changed by clicking the colored box next to the
enzyme's name. The name of the enzyme can also be shown next to the restriction site by
selecting Show above the list of restriction enzymes.
There is also an option to specify how the Labels should be shown:
• No labels. This will just display the cut site with no information about the name of the
enzyme. Placing the mouse button on the cut site will reveal this information as a tool tip.
• Flag. This will place a flag just above the sequence with the enzyme name (see an example
in figure 23.2). Note that this option will make it hard to see when several cut sites are
CHAPTER 23. CLONING AND RESTRICTION SITES 503
located close to each other. In the circular view, this option is replaced by the Radial option.
• Radial. This option is only available in the circular view. It will place the restriction site
labels as close to the cut site as possible (see an example in figure 23.3).
• Stacked. This is similar to the flag option for linear sequence views, but it will stack the
labels so that all enzymes are shown. For circular views, it will align all the labels on each
side of the circle. This can be useful for clearly seeing the order of the cut sites when they
are located closely together (see an example in figure 23.4).
Note that in a circular view, the Stacked and Radial options also affect the layout of annotations.
Just above the list of enzymes, three buttons can be used for sorting the list (see figure 23.5).
• Sort enzymes alphabetically ( ). Clicking this button will sort the list of enzymes
alphabetically.
• Sort enzymes by number of restriction sites ( ). This will divide the enzymes into four
groups:
Non-cutters.
Single cutters.
Double cutters.
CHAPTER 23. CLONING AND RESTRICTION SITES 504
Multiple cutters.
There is a checkbox for each group which can be used to hide / show all the enzymes in a
group.
• Sort enzymes by overhang ( ). This will divide the enzymes into three groups:
There is a checkbox for each group which can be used to hide / show all the enzymes in a
group.
Manage enzymes
The list of restriction enzymes contains per default some of the most popular enzymes, but you
can easily modify this list and add more enzymes by clicking the Manage enzymes button found
at the bottom of the "Restriction sites" palette of the Side Panel.
This will open the dialog shown in figure 23.6.
At the top, you can choose to Use existing enzyme list. Clicking this option lets you select an
enzyme list which is stored in the Navigation Area. A list of popular enzymes is available in the
Example Data folder you can download from the Help menu.
Below there are two panels:
• To the left, you can see all the enzymes that are in the list selected above. If you have not
chosen to use a specific enzyme list, this panel shows all the enzymes available.
• To the right, you can see the list of the enzymes that will be used.
CHAPTER 23. CLONING AND RESTRICTION SITES 505
Select enzymes in the left side panel and add them to the right panel by double-clicking or clicking
the Add button ( ).
The enzymes can be sorted by clicking the column headings, i.e., Name, Overhang, Methylation
or Popularity. This is particularly useful if you wish to use enzymes which produce a 3' overhang
for example.
When looking for a specific enzyme, it is easier to use the Filter. You can type HindIII or blunt
into the filter, and the list of enzymes will shrink automatically to only include respectively only
the HindIII enzyme, or all enzymes producing a blunt cut.
If you need more detailed information and filtering of the enzymes, you can hover your mouse on
an enzyme (see figure 23.7). You can also open a view of an enzyme list saved in the Navigation
Area.
Figure 23.7: Showing additional information about an enzyme like recognition sequence or a list of
commercial vendors.
At the bottom of the dialog, you can select to save the updated list of enzymes as a new file.
When you click on Finish, the enzymes are added to the Side Panel and the cut sites are shown
on the sequence. You can save the settings in the Side Panel, including the enzymes just added,
as described in section 4.6).
Figure 23.8: Deciding number of cut sites inside and outside the selection.
• Inside selection. Specify how many times you wish the enzyme to cut inside the selection.
• Outside selection. Specify how many times you wish the enzyme to cut outside the
selection (i.e. the rest of the sequence).
These panels offer a lot of flexibility for combining number of cut sites inside and outside
the selection, respectively. To give a hint of how many enzymes will be added based on the
combination of cut sites, the preview panel at the bottom lists the enzymes which will be added
when you click Finish. Note that this list is dynamically updated when you change the number of
cut sites. The enzymes shown in brackets [] are enzymes which are already present in the Side
Panel.
If you have selected more than one region on the sequence (using Ctrl or ), they will be treated
as individual regions. This means that the criteria for cut sites apply to each region.
You first specify the sequences to analyze, and in the next step, which enzymes to use.
See section 23.1.1 for information about managing the restriction enzymes available.
In the next wizard step, you can limit the list of sites reported, depending on how many times
they cut a sequence (figure 23.10). The default is to report enzymes that cut the sequence one
or two times.
The Result handling wizard step (figure 23.11) lets you specify how the result of the restriction
CHAPTER 23. CLONING AND RESTRICTION SITES 508
Figure 23.11: Choosing to add restriction sites as annotations or creating a restriction map.
Add restriction sites as annotations to sequence(s) . This option makes it possible to see the
restriction sites on the sequence (see figure 23.12) and save the annotations for later use.
Create restriction map . When a restriction map is created, it can be shown in three different
ways:
• As a table of restriction sites as shown in figure 23.13. If more than one sequence were
selected, the table will include the restriction sites of all the sequences. This makes it
easy to compare the result of the restriction map analysis for two sequences.
Figure 23.13: The result of the restriction analysis shown as a table of restriction sites.
Each row in the table represents a restriction enzyme. The following information is available
for each enzyme:
Sequence. The name of the sequence which is relevant if you have performed
restriction map analysis on more than one sequence.
CHAPTER 23. CLONING AND RESTRICTION SITES 509
• As a table of fragments which shows the sequence fragments that would be the result of
cutting the sequence with the selected enzymes (see figure23.14). Click the Fragments
button ( ) at the bottom of the view.
Figure 23.14: The result of the restriction analysis shown as table of fragments.
Each row in the table represents a fragment. If more than one enzyme cuts in the same
region, or if an enzyme's recognition site is cut by another enzyme, there will be a fragment
for each of the possible cut combinations. Furthermore, if this is the case, you will see the
names of the other enzymes in the Conflicting Enzymes column.
The following information is available for each fragment.
Sequence. The name of the sequence which is relevant if you have performed
restriction map analysis on more than one sequence.
Length including overhang. The length of the fragment. If there are overhangs of the
fragment, these are included in the length (both 3' and 5' overhangs).
Region. The fragment's region on the original sequence.
Overhangs. If there is an overhang, this is displayed with an abbreviated version of the
fragment and its overhangs. The two rows of dots (.) represent the two strands of the
fragment and the overhang is visualized on each side of the dots with the residue(s)
that make up the overhang. If there are only the two rows of dots, it means that there
is no overhang.
Left end. The enzyme that cuts the fragment to the left (5' end).
Right end. The enzyme that cuts the fragment to the right (3' end).
Conflicting enzymes. If more than one enzyme cuts at the same position, or if an
enzyme's recognition site is cut by another enzyme, a fragment is displayed for each
possible combination of cuts. At the same time, this column will display the enzymes
CHAPTER 23. CLONING AND RESTRICTION SITES 510
that are in conflict. If there are conflicting enzymes, they will be colored red to alert
the user. If the same experiment were performed in the lab, conflicting enzymes
could lead to wrong results. For this reason, this functionality is useful to simulate
digestions with complex combinations of restriction enzymes.
If views of both the fragment table and the sequence are open, clicking in the fragment
table will select the corresponding region on the sequence.
• As a virtual gel simulation which shows the fragments as bands on a gel (see figure 23.48).
For more information about gel electrophoresis, see section 23.6.
At the top, you can select an existing enzyme list or you can use the full list of enzymes (default).
Select an enzyme, and you will see its recognition sequence in the text field below the list
(AAGCTT). If you wish to insert additional residues such as tags, this can be typed into the text
fields adjacent to the recognition sequence.
Click OK will insert the restriction site and the tag(s) before or after the selection. If the enzyme
selected was not already present in the list in the Side Panel, it will now be added and selected.
laboratory freezer, or all enzymes used to create a given restriction map or all enzymes that are
available form the preferred vendor.
In the Example data (import in your Navigation Area using the Help menu), under Nucleotide-
>Restriction analysis, there are two enzyme lists: one with the 50 most popular enzymes, and
another with all enzymes that are included in the CLC Main Workbench.
Create enzyme list CLC Main Workbench uses enzymes from the REBASE restriction enzyme
database at http://rebase.neb.com. If you want to customize the enzyme database for
your installation, see section C.
To create an enzyme list of a subset of these enzymes:
File | New | Enzyme list ( )
This opens the dialog shown in figure 23.16
Choose which enzyme you want to include in the new enzyme list (see section 23.1.1), and click
Finish to open the enzyme list.
View and modify enzyme list An enzyme list is shown in figure 23.17. It can be sorted by
clicking the columns, and you can use the filter at the top right corner to search for specific
enzymes, recognition sequences etc.
If you wish to remove or add enzymes, click the Add/Remove Enzymes button at the bottom of
the view. This will present the same dialog as shown in figure 23.16 with the enzyme list shown
to the right.
If you wish to extract a subset of an enzyme list, open the list, select the relevant enzymes,
right-click on the selection and choose to Create New Enzyme List from Selection ( ).
If you combined this method with the filter located at the top of the view, you can extract a very
specific set of enzymes. for example, if you wish to create a list of enzymes sold by a particular
CHAPTER 23. CLONING AND RESTRICTION SITES 512
distributor, type the name of the distributor into the filter and select and create a new enzyme
list from the selection.
Figure 23.18: Selecting the sequences containing the fragments you want to clone and the vector.
CLC Main Workbench will now create a sequence list of the selected fragments and vector
CHAPTER 23. CLONING AND RESTRICTION SITES 513
sequences. For cloning work, open the sequence list and switch to the Cloning Editor ( ) at the
bottom of the view (figure 23.19).
Figure 23.19: Cloning editor view of the sequence list. Choose which sequence to display from the
drop down menu.
If you later in the process need additional sequences, right-click anywhere on the empty white
area of the view and choose to "Add Sequences".
• At the top, there is a panel to switch between the sequences selected as input for the
cloning. You can also specify whether the sequence should be visualized as circular or as
a fragment. On the right-hand side, you can select a vector: the button is by default set to
Change to Current. Click on it to select the currently shown sequence as vector.
• In the middle, the selected sequence is shown. This is the central area for defining how
the cloning should be performed.
• At the bottom, there is a panel where the selection of fragments and target vector is
performed.
CHAPTER 23. CLONING AND RESTRICTION SITES 514
• Click on the Cloning Editor icon ( ) in the view area when a sequence list has been
opened in the sequence list editor.
• Create a new cloning experiment using the Restriction Based Cloning ( ) action from the
toolbox. This tool collects a set of existing sequences and creates a new sequence list.
• Cloning mode Opened when one of the sequences has been selected as 'Vector'. In
this mode, you can apply one or more cuts to the vector, thereby creating an opening
for insertion of other sequence fragments. From the remaining sequences in the cloning
experiment/sequence list, either complete sequences or fragments created by cutting can
be inserted into the vector. In the cloning adapter dialog, the order and direction of the
inserted fragments can be adjusted prior to adjusting the overhangs to match the cloning
conditions.
• Stitch mode If no sequence has ben selected as 'Vector', a number of fragments (either
full sequences or cuttings) can be selected from the cloning experiment. These can then
be stitched together into a single new sequence. In the stitching adapter dialog, the order
and direction of the fragments can be adjusted prior to adjusting the overhangs to match
the stitch conditions.
Figure 23.21: EcoRI site used to open the vector. Note that the "Cloning" button has now been
enabled as both criteria ("Target vector selection defined" and "Fragments to insert:...") have been
defined.
If you want to cut off part of the vector, click two restriction sites while pressing the Ctrl key
( on Mac). You can also right-click the cut sites and use the Select This ... Site to select
a site. This will display two options for what the target vector should be (for linear vectors
there would have been three option). At any time, the selection of cut sites can be cleared
by clicking the Remove ( ) icon to the right of the target vector selections.
3. Perform cloning
CHAPTER 23. CLONING AND RESTRICTION SITES 516
Once both fragments and vector are selected, click Clone ( ). This will display a dialog to
adapt overhangs and change orientation as shown in figure 23.22.
This dialog visualizes the details of the insertion. The vector sequence is on each side
shown in a faded gray color. In the middle the fragment is displayed. If the overhangs of
the sequence and the vector do not match ( ), you will not be able to click Finish. But
you can blunt end or fill in the overhangs using the drag handles ( ) until the overhangs
match ( ).
The fragment can be reverse complemented by clicking the Reverse complement fragment
( ).
When several fragments are used, the order of the fragments can be changed by clicking
the move buttons ( )/ ( ).
Per default, the construct will be opened in a new view and can be saved separately. But
selecting the option Replace input sequences with result will add the construct to the
input sequence list and delete the original fragment and vector sequences.
Note that the cloning experiment used to design the construct can be saved as well. If you check
the History ( ) of the construct, you can see the details about restriction sites and fragments
used for the cloning.
• Duplicate sequence. Adds a duplicate of the selected sequence to the sequence list
accessible from the drop down menu on top of the Cloning view.
• Insert sequence after this sequence ( ). The sequence to be inserted can be selected
from the sequence list via the drop down menu on top of the Cloning view. The inserted
CHAPTER 23. CLONING AND RESTRICTION SITES 517
sequence remains on the list of sequences. If the two sequences do not have blunt ends,
the ends' overhangs have to match each other.
• Insert sequence before this sequence ( ). The sequence to be inserted can be selected
from the sequence list via the drop down menu on top of the Cloning view. The inserted
sequence remains on the list of sequences. If the two sequences do not have blunt ends,
the ends' overhangs have to match each other.
• Reverse sequence. Reverses the sequence and replaces the original sequence in the list.
This is sometimes useful when working with single stranded sequences. Note that this is
not the same as creating the reverse complement of a sequence.
• Delete sequence ( ). Deletes the given sequence from the Cloning Editor.
• Make sequence linear ( ). Converts a sequence from a circular to a linear form, removing
the << and >> at the ends.
CHAPTER 23. CLONING AND RESTRICTION SITES 518
• Duplicate Selection. If a selection on the sequence is duplicated, the selected region will
be added as a new sequence to the Cloning Editor. The new sequence name representing
the length of the fragment. When double-clicking on a sequence, the region between the
two closest restriction sites is automatically selected.
• Replace Selection with sequence. Replaces the selected region with a sequence selected
from the drop down menu listing all sequences in the Cloning Editor.
• Cut Sequence Before Selection ( ). Cleaves the sequence before the selection and will
result in two smaller fragments.
• Cut Sequence After Selection ( ). Cleaves the sequence after the selection and will
result in two smaller fragments.
• Make Positive Strand Single Stranded ( ). Makes the positive strand of the selected
region single stranded.
• Make Negative Strand Single Stranded ( ). Makes the negative strand of the selected
region single stranded.
• Make Double Stranded ( ). This will make the selected region double stranded.
• Move Starting Point to Selection Start. This is only active for circular sequences. It will
move the starting point of the sequence to the beginning of the selection.
• Copy ( ). Copies the selected region to the clipboard, which will enable it for use in other
programs.
• Open Selection in New View ( ). Opens the selected region in the normal sequence view.
CHAPTER 23. CLONING AND RESTRICTION SITES 519
• Edit Selection ( ). Opens a dialog box in which is it possible to edit the selected residues.
• Insert Restriction Sites After/Before Selection. Shows a dialog where you can choose
from a list restriction enzymes (see section 23.1.3).
• Show Enzymes Cutting Inside/Outside Selection ( ). Adds enzymes cutting this selection
to the Side Panel.
• Add Structure Prediction Constraints. This is relevant for RNA secondary structure
prediction:
Force Stem Here is activated after choosing 2 regions of equal length on the sequence.
It will add an annotation labeled "Forced Stem" and will force the algorithm to compute
minimum free energy and structure with a stem in the selected region.
Prohibit Stem Here is activated after choosing 2 regions of equal length on the
sequence. It will add an annotation labeled "Prohibited Stem" to the sequence and
will force the algorithm to compute minimum free energy and structure without a stem
in the selected region.
Prohibit From Forming Base Pairs will add an annotation labeled "No base pairs"
to the sequence, and will force the algorithm to compute minimum free energy and
structure without a base pair containing any residues in the selected region.
The sequence that you have chosen to insert into will be marked with bold and the text [vector]
is appended to the sequence name. Note that this is completely unrelated to the vector concept
in the cloning workflow described in section 23.3.2.
Furthermore, the list includes the length of the fragment, an indication of the overhangs, and a
list of enzymes that are compatible with this overhang (for the left and right ends, respectively).
If not all the enzymes can be shown, place your mouse cursor on the enzymes, and a full list will
be shown in the tool tip.
CHAPTER 23. CLONING AND RESTRICTION SITES 520
Select the sequence you wish to insert and click Next to adapt insert sequence to vector dialog
(figure 23.26).
At the top is a button to reverse complement the inserted sequence.
Below is a visualization of the insertion details. The inserted sequence is at the middle shown in
red, and the vector has been split at the insertion point and the ends are shown at each side of
the inserted sequence.
If the overhangs of the sequence and the vector do not match ( ), you can blunt end or fill in
the overhangs using the drag handles ( ) until it does ( ).
At the bottom of the dialog is a summary field which records all the changes made to the
overhangs. This contents of the summary will also be written in the history ( ) of the cloning
experiment.
When you click Finish, the sequence is inserted and highlighted by being selected.
Figure 23.27: One sequence is now inserted into the cloning vector. The sequence inserted is
automatically selected.
Figure 23.28: Select the vector and fragments that should be assembled in the homology based
cloning reaction.
Press Next to open the wizard allowing you to inspect and adjust primers and overhangs.
General options
General options are at the top of the wizard. These include the position of the insertion site in the
vector, the maximum primer and overhang lengths as well as option to set the Tm and overhang
length for all primers at once. There is also a diagram of the vector including the inserts, where
each sequence has a different colour (figure 23.29).
Sequences
CHAPTER 23. CLONING AND RESTRICTION SITES 522
Figure 23.29: The top section of the wizard contains general options.
Each sequence is displayed individually, with a coloured bar to the left and a vertical scroll bar at
the bottom. The top sequence is the vector, with the insert sequences displayed further down.
The order of the sequences reflects how they will be assembled into the vector, and the overhangs
on the primers support this assembly order.
Vector, inserts, primers and overhangs are color coded (figure 23.30):
• Blue Added bases that are inserted between primer and overhang
For each sequence, you can adjust primer and overhang lengths and add bases between primers
and overhangs.
The vector sequence is considered circular and primers are depicted as pointing away from each
other in order to amplify the circular sequence. Inserts are considered linear, and primers are
placed at the ends of the insert sequence pointing towards each other in order to amplify the
linear sequence (figure 23.30).
• If one insert is assembled into a vector < 8 kb in length, overhangs are added to the vector
primers.
• If one insert is assembled into a vector > 8 kb in length, overhangs are added to the insert
primers.
• If more inserts are assembled into a vector, the overhangs are added to insert primers.
Primer and overhang lengths should be adjusted, according to the cloning kit used.
CHAPTER 23. CLONING AND RESTRICTION SITES 523
Figure 23.30: Top: The vector sequence and primers with overhangs. The grey sequence between
the primers is not included in the PCR product. Bottom: An insert sequence and primers with
overhangs.
Figure 23.31: Choose an insertion site from the drop down menu or type position(s) directly in the
Insertion site text field.
• To adjust all primers at once, change the Primer Tm in the top section and press Calculate
primers (figure 23.29). This will update primers on all sequences.
• To adjust all overhangs at once, change the Overhang length in the top section and press
Set Overhang Lengths (figure 23.29). This will update overhangs on all sequences.
• To adjust the length of individual primers and overhangs use the Primer length and
Overhang length options available for the forward and reverse primer on each sequence.
You can also extend or shorten the the primer and overhang sequences by dragging the
arrow symbols at the ends of the primers and overhangs (figure 23.30).
Summary Contains the number of fragments and primers used in the cloning reaction
as well as their lengths and any warnings.
Fragments Lists the vector and fragments used in the cloning reaction.
Warnings Lists the warnings given for primer pairs.
Primer pairs Lists fragments for which primers were designed together with pair
annealing and pair end annealign values for the primer pairs. See section 22.5.3 for
information about annealing values.
Primers Lists individual primers and their sequence. Primer sequence is written with
capital letters, whereas added bases and overhangs are in lowercase.
Primer parts Lists full and subparts of designed primers with characteristics such as
length and G/C content. The following terms are used:
∗ Full The full primer including overhang and added bases.
∗ Anneal The part of the primer annealing to the original fragment (primer without
overhang or any added bases).
CHAPTER 23. CLONING AND RESTRICTION SITES 525
• Assembled vector The vector as it will appear after all fragments have been assembled.
The assembled vector will be annotated with the positions of primers, added bases and
overhangs as well as with inserts and vector sequence. When the vector is opened, you
can select which annotations should be shown on the sequence in the side panel under
Annotation types.
Note: the assembled vector can be used as input to Homology Based Cloning if you wish
to adjust a previous design.
• Primers sequence list A sequence list containing the designed primers. The primers are
annotated with primer, added bases and overhang, where primer is the part of the sequence
that originally aligned to the insert or vector that was amplified.
• PCR fragments sequence list The PCR fragments generated from input sequences and
designed primers including additional bases and overhangs.
• Primer pairs table A table providing information about melting temperatures, secondary
structure, etc., for primer pairs with and without overhangs. For a description of each of the
columns in the Primer Pairs table, see section 22.5.3.
This section contains a detailed description of the Homology Based Cloning options (figure
23.32).
• Insertion site The position where fragments will be inserted in the vector. You can type in
a specific position or a range of positions. You can also choose the start, the end, or the
entire span of an annotation on the vector using the drop down menu (figure 23.33).
The primers designed to amplify the vector will be placed so that their 5' ends are adjacent
to the insertion site. If a range of positions are selected, the primers will be placed so
that the selected positions are not included in the PCR product. When the insertion site is
changed, the vector primers in the view below are updated accordingly.
Insertion site examples:
0 or 0 1 Assembles inserts into the vector between the last and the first base.
1 or 1 2 Assembles inserts into the vector between the first and second base.
1..10 Inserts replace bases 1-10 in the vector.
Start of an annotation Assembles inserts into the vector before the first base in the
annotated region.
Span of an annotation Inserts replace all bases in the annotated region.
End of an annotation Assembles inserts into the vector after the last base in the
annotated region.
CHAPTER 23. CLONING AND RESTRICTION SITES 526
Figure 23.32: General options for the cloning experiment are provided at the top of the wizard,
followed by sections for the vector and insert sequences, where many options relevant to the
cloning experiment can be adjusted.
• Maximum primer length The maximum length that primers for vectors and inserts can be.
This is reflected in the number of nucleotides visible for each sequence in the views below.
• Maximum overhang length The maximum length that overhangs for vectors and inserts can
be.
CHAPTER 23. CLONING AND RESTRICTION SITES 527
Figure 23.33: Specify the insertion site in the vector. Here the entire Lac-operon has been selected
from the drop down menu. Notice that when a span of bases are chosen as insertion site, the
vector sequence between the primers is grey and not included in the PCR product.
• Font size The font size to use for vector and insert sequences, and for primers and
overhangs.
• Primer Tm The primer melting temperature. This value does not take into account any
added bases or overhangs. Click on Calculate primers to update all primers after changing
this value.
• Overhang length The length of the overhang added to primers not including added bases.
Click on Set Overhang Lengths to update overhangs after changing this value.
• Open Primer Pairs Table Opens a table listing each of the primer pairs shown on the
sequences below. The primer pairs table contains primer pairs, both with and without
overhangs and added bases. It also provides information about melting temperatures,
secondary structure, etc. For a description of each of the columns in the Primer Pairs table,
see section 22.5.3.
• Vector map A vector map showing the assembled vector. Each original fragment has its
own color that matches the side bars of the sequences in the views below. If you hover
over the sequence of the vector or an insert, it will become bold in the vector map. If you
hover over a primer, it will appear on the vector map. The fragments, but not the primers
are drawn to scale.
• Sequence Name: n (vector, circular) and Sequence Name: n (insert y, linear). The
sequences identified as the vector and inserts.
• Arrows to the left of Sequence Name Change the order of the sequences in the list using
the up and down arrows. Reverse complement the sequence using the horizontal arrows.
• Primer length and Overhang length The primer and overhang lengths for the forward and
reverse primer, respectively. These lengths can be adjusted by typing new values into the
CHAPTER 23. CLONING AND RESTRICTION SITES 528
dialogs, or by using the up and down arrows to the right of the dialogs. Changes to the
lengths are immediately updated in the sequence view below. The Tm and primer pair
annealing alignment are also updated.
• Tm The primer melting temperature. This value does not include overhangs or any added
bases.
• Primer pair annealing alignment Predicted primer-primer annealing of the forward and
reverse primers. Overhangs and added bases are not included. The same plot is also
available in the primer pairs table.
• Added bases Insert additional bases between the primer and overhang. You can either type
the bases directly into the dialog, or you can choose the sequence of a specific restriction
enzyme from the drop down menu.
• Sequence and primer views For each sequence included in the homology cloning reaction,
you can see the part of the sequence that primers are designed to, as well as the primers
and their overhangs. For the vector, the fragment is considered circular and the primers
are placed pointing in opposite directions from the insertion site (figure 23.34). Inserts are
considered linear and primers are placed at the ends (figure 23.35).
Vector, inserts, primers and overhangs are color coded (figure 23.30):
The overhang of a primer for a given sequence is identical to the sequence that it will be
adjacent to in the assembled vector. Figures 23.34 and 23.35 show an example where
the linear sequence can be inserted into the circular sequence. Pink overhang bases on
the primers for the circular fragment are either the same sequence or complementary to
the black sequence of the linear DNA fragment. Overhangs are designed to assemble the
fragments in the order they appear in the wizard. In this example, two sequences are
assembled, but more than two can be used for homology based cloning.
• Warnings in sequence views A yellow or red exclamation mark next to the sequence name
warns of any problems 23.36. Hover over the primer to get more information from the
CHAPTER 23. CLONING AND RESTRICTION SITES 529
tooltip or click on the warning to open a dialog showing the warning message. Examples of
when warnings appear include:
Figure 23.36: Hover over the yellow exclamation mark to see the warnings in the tooltip.
• Introduce mutations manually in the sequences before running Homology Based Cloning.
Place primers over mutated sites to ensure the mutations are included in the primer
sequence.
• Run Homology Based Cloning using the original sequences, and then introduce mutations
into the assembled vector. Re-run Homology Based Cloning using the vector containing the
mutations.
• Second, the attB-flanked fragment is recombined into a donor vector (the BP reaction) to
construct an entry clone
• Finally, the target fragment from the entry clone is recombined into an expression vector
(the LR reaction) to construct an expression clone. For Multi-site gateway cloning, multiple
entry clones can be created that can recombine in the LR reaction.
CHAPTER 23. CLONING AND RESTRICTION SITES 530
During this process, both the attB-flanked fragment and the entry clone can be saved.
For more information about the Gateway technology, please visit https://www.thermofisher.
com/us/en/home/life-science/cloning/gateway-cloning/gateway-technology.html. To
perform these analyses in CLC Main Workbench, you need to import donor and expression vec-
tors. These can be found on the Thermo Fisher Scientific's website: find the relevant vector
sequences, copy them, and paste them in the field that opens when you choose New | Sequence
in the workbench. Fill in additional information appropriately (enter a "Name", check the "Circular"
option) and save the sequences in the Navigation Area.
The default option is to use the attB1 and attB2 sites. If you have selected several fragments
and wish to add different combinations of sites, you will have to run this tool once for each
combination.
Next, you are given the option to extend the fragment with additional sequences by extending the
primers 5' of the template-specific part of the primer, i.e., between the template specific part
and the attB sites.
You can manually type or paste in a sequence of your choice, but it is also possible to click in
the text field and press Shift + F1 (Shift + Fn + F1 on Mac) to show some of the most common
additions (see figure 23.38). Use the up and down arrow keys to select a tag and press Enter.
To learn how to modify the default list of primer additions, see section 23.5.1.
At the bottom of the dialog, you can see a preview of what the final PCR product will look like. In
the middle there is the sequence of interest. In the beginning is the attB1 site, and at the end is
the attB2 site. The primer additions that you have inserted are shown in colors.
CHAPTER 23. CLONING AND RESTRICTION SITES 531
Figure 23.38: Primer additions 5' of the template-specific part of the primer where a Shine-Dalgarno
site has been added between the attB site and the gene of interest.
In the next step, specify the length of the template-specific part of the primers as shown in figure
23.39.
Figure 23.39: Specifying the length of the template-specific part of the primers.
The Workbench is not doing any kind of primer design when adding the attB sites. As a user, you
simply specify the length of the template-specific part of the primer, and together with the attB
sites and optional primer additions, this will be the primer. The primer region will be annotated
in the resulting attB-flanked sequence. You can also choose to get a list of primers in the Result
handling dialog (see figure 23.40).
The attB sites, the primer additions and the primer regions are annotated in the final result as
shown in figure 23.41 (you may need to switch on the relevant annotation types to show the
sites and primer additions).
There will be one output sequence for each sequence you have selected for adding attB sites.
Save ( ) the resulting sequence as it will be the input to the next part of the Gateway cloning
workflow (see section 23.5.2).
Figure 23.40: Besides the main output which is a copy of the input sequence(s) now including attB
sites and primer additions, you can get a list of primers as output.
Figure 23.41: the attB site plus the Shine-Dalgarno primer addition is annotated.
Name When the sequence fragment is extended with a primer addition, an annotation will be
added displaying this name.
Sequence The actual sequence to be inserted, defined on the sense strand (although the reverse
primer would be reverse complement).
Annotation type The annotation type of the primer that is added to the fragment.
Forward primer addition Whether this addition should be visible in the list of additions for the
forward primer.
Reverse primer addition Whether this addition should be visible in the list of additions for the
reverse primer.
Figure 23.42: Configuring the list of primer additions available when adding attB sites.
sure that the sequence of the destination vector was saved in the Navigation Area: find the
relevant vector sequence on the Thermo Fisher Scientific's website, copy it, and paste it in
in the field that opens when you choose New | Sequence in the workbench. Fill in additional
information appropriately (enter a "Name", check the "Circular" option) and save the sequence
in the Navigation Area.
Tools | Cloning ( )| Gateway Cloning ( ) | Create Entry Clone ( )
In the first wizard window, select one or more sequences to be recombined into your donor vector.
Note that the sequences you select should be flanked with attB sites (see section 23.5.1). You
can select more than one sequence as input, and the corresponding number of entry clones will
be created.
In the following dialog (figure 23.43), you can specify a donor vector.
Once the vector is selected, a preview of the fragments selected and the attB sites that they
contain is shown. This can be used to get an overview of which entry clones should be used and
check that the right attB sites have been added to the fragments. Also note that the workbench
looks for the attP sites (see how to change the definition of sites in appendix D), but it does not
check that they correspond to the attB sites of the selected fragments at this step. If the right
combination of attB and attP sites is not found, no entry clones will be produced.
CHAPTER 23. CLONING AND RESTRICTION SITES 534
The output is one entry clone per sequence selected. The attB and attP sites have been used for
the recombination, and the entry clone is now equipped with attL sites as shown in figure 23.44.
Note that the bi-product of the recombination is not part of the output.
In the second step, select the destination vector that was previously saved in the Navigation
Area (fig 23.45).
Note that the workbench looks for the specific sequences of the attR sites in the sequences that
you select in this dialog (see how to change the definition of sites in appendix D), but it does not
check that they correspond to the attL sites of the selected fragments. If the right combination
of attL and attR sites is not found, no entry clones will be produced.
When performing multi-site gateway cloning, CLC Main Workbench will insert the fragments (con-
tained in entry clones) by matching the sites that are compatible. If the sites have been defined
correctly, an expression clone containing all the fragments will be created. You can find an expla-
nation of the multi-site gateway system at https://www.thermofisher.com/dk/en/home/
life-science/cloning/gateway-cloning/multisite-gateway-technology.html?
SID=fr-gwcloning-3
The output is a number of expression clones depending on how many entry clones and destination
vectors that you selected. The attL and attR sites have been used for the recombination, and the
expression clone is now equipped with attB sites as shown in figure 23.46.
You can choose to create a sequence list with the bi-products as well.
• When using the Restriction Site Analysis tool, available under the Tools menu, you can
choose to create a restriction map which can be shown as a gel (see section 23.1.2).
• From all the graphical views of sequences, you can right-click the name of the sequence
and choose Digest and Create Restriction Map ( ). The sequence will be digested with
the enzymes that are selected in the Side Panel. The views where this option is available
are listed below:
Information on bands / fragments You can get information about the individual bands by
hovering the mouse cursor on the band of interest. This will display a tool tip with the following
information:
CHAPTER 23. CLONING AND RESTRICTION SITES 537
Figure 23.48: Five lanes showing fragments of five sequences cut with restriction enzymes.
• Fragment length
For gels comparing whole sequences, you will see the sequence name and the length of the
sequence.
Note! You have to be in Selection ( ) or Pan ( ) mode in order to get this information.
It can be useful to add markers to the gel which enables you to compare the sizes of the bands.
This is done by clicking Show marker ladder in the Side Panel.
Markers can be entered into the text field, separated by commas.
Modifying the layout The background of the lane and the colors of the bands can be changed
in the Side Panel. Click the colored box to display a dialog for picking a color. The slider Scale
band spread can be used to adjust the effective time of separation on the gel, i.e. how much
the bands will be spread over the lane. In a real electrophoresis experiment this property will be
determined by several factors including time of separation, voltage and gel density.
You can also choose how many lanes should be displayed:
• Sequences in separate lanes. This simulates that a gel is run for each sequence.
• All sequences in one lane. This simulates that one gel is run for all sequences.
You can also modify the layout of the view by zooming in or out. Click Zoom in ( ) or Zoom out
( ) in the Toolbar and click the view.
Finally, you can modify the format of the text heading each lane in the Text format preferences in
the Side Panel.
Chapter 24
RNA structure
Contents
24.1 RNA secondary structure prediction . . . . . . . . . . . . . . . . . . . . . . . 539
24.1.1 Selecting sequences for prediction . . . . . . . . . . . . . . . . . . . . . 539
24.1.2 Secondary structure prediction parameters . . . . . . . . . . . . . . . . 540
24.1.3 Structure as annotation . . . . . . . . . . . . . . . . . . . . . . . . . . . 544
24.2 View and edit secondary structures . . . . . . . . . . . . . . . . . . . . . . . 545
24.2.1 Graphical view and editing of secondary structure . . . . . . . . . . . . . 545
24.2.2 Tabular view of structures and energy contributions . . . . . . . . . . . . 548
24.2.3 Symbolic representation in sequence view . . . . . . . . . . . . . . . . . 551
24.2.4 Probability-based coloring . . . . . . . . . . . . . . . . . . . . . . . . . . 552
24.3 Evaluate structure hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . 553
24.3.1 Selecting sequences for evaluation . . . . . . . . . . . . . . . . . . . . . 553
24.3.2 Probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 554
24.4 Structure scanning plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 555
24.4.1 Selecting sequences for scanning . . . . . . . . . . . . . . . . . . . . . 555
24.4.2 The structure scanning result . . . . . . . . . . . . . . . . . . . . . . . . 556
24.5 Bioinformatics explained: RNA structure prediction by minimum free energy
minimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 557
24.5.1 The algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 558
24.5.2 Structure elements and their energy contribution . . . . . . . . . . . . . 559
Ribonucleic acid (RNA) is a nucleic acid polymer that plays several important roles in the cell.
As for proteins, the three dimensional shape of an RNA molecule is important for its molecular
function. A number of tertiary RNA structures are know from crystallography but de novo prediction
of tertiary structures is not possible with current methods. However, as for proteins RNA tertiary
structures can be characterized by secondary structural elements which are hydrogen bonds
within the molecule that form several recognizable "domains" of secondary structure like stems,
hairpin loops, bulges and internal loops. A large part of the functional information is thus
contained in the secondary structure of the RNA molecule, as shown by the high degree of
base-pair conservation observed in the evolution of RNA molecules.
538
CHAPTER 24. RNA STRUCTURE 539
Computational prediction of RNA secondary structure is a well defined problem and a large body
of work has been done to refine prediction algorithms and to experimentally estimate the relevant
biological parameters.
In CLC Main Workbench we offer the user a number of tools for analyzing and displaying RNA
structures. These include:
Figure 24.1: Selecting RNA or DNA sequences for structure prediction (DNA is folded as if it was
RNA).
Structure output
The predict secondary structure algorithm always calculates the minimum free energy structure
of the input sequence. In addition to this, it is also possible to compute a sample of suboptimal
structures by ticking the checkbox Compute sample of suboptimal structures.
Subsequently, you can specify how many structures to include in the output. The algorithm then
iterates over all permissible canonical base pairs and computes the minimum free energy and
associated secondary structure constrained to contain a specified base pair. These structures
are then sorted by their minimum free energy and the most optimal are reported given the
specified number of structures. Note that two different sub-optimal structures can have the
same minimum free energy. Further information about suboptimal folding can be found in [Zuker,
1989a].
CHAPTER 24. RNA STRUCTURE 541
Partition function
The predicted minimum free energy structure gives a point-estimate of the structural conformation
of an RNA molecule. However, this procedure implicitly assumes that the secondary structure
is at equilibrium, that there is only a single accessible structure conformation, and that the
parameters and model of the energy calculation are free of errors.
Obvious deviations from these assumptions make it clear that the predicted MFE structure may
deviate somewhat from the actual structure assumed by the molecule. This means that rather
than looking at the MFE structure it may be informative to inspect statistical properties of the
structural landscape to look for general structural properties which seem to be robust to minor
variations in the total free energy of the structure (see [Mathews et al., 2004]).
To this end CLC Main Workbench allows the user to calculate the complete secondary structure
partition function using the algorithm described in [Mathews et al., 2004] which is an extension
of the seminal work by [McCaskill, 1990].
There are two options regarding the partition function calculation:
• Calculate base pair probabilities. This option invokes the partition function calculation and
calculates the marginal probabilities of all possible base pairs and the marginal probability
that any single base is unpaired.
• Create plot of marginal base pairing probabilities. This creates a plot of the marginal base
pair probability of all possible base pairs as shown in figure 24.3.
Figure 24.3: The marginal base pair probability of all possible base pairs.
The marginal probabilities of base pairs and of bases being unpaired are distinguished by colors
which can be displayed in the normal sequence view using the Side Panel - see section 24.2.3
and also in the secondary structure view. An example is shown in figure 24.4. Furthermore, the
marginal probabilities are accessible from tooltips when hovering over the relevant parts of the
structure.
CHAPTER 24. RNA STRUCTURE 542
Figure 24.4: Marginal probability of base pairs shown in linear view (top) and marginal probability
of being unpaired shown in the secondary structure 2D view (bottom).
Advanced options
The free energy minimization algorithm includes a number of advanced options:
• Avoid isolated base pairs. The algorithm filters out isolated base pairs (i.e. stems of length
1).
• Apply different energy rules for Grossly Asymmetric Interior Loops (GAIL). Compute the
minimum free energy applying different rules for Grossly Asymmetry Interior Loops (GAIL). A
Grossly Asymmetry Interior Loop (GAIL) is an interior loop that is 1 × n or n × 1 where n > 2
(see http://www.unafold.org/doc/mfold-manual/node5.php).
• Include coaxial stacking energy rules. Include free energy increments of coaxial stacking
for adjacent helices [Mathews et al., 2004].
• Apply base pairing constraints. With base pairing constraints, you can easily add
CHAPTER 24. RNA STRUCTURE 543
experimental constraints to your folding algorithm. When you are computing suboptimal
structures, it is not possible to apply base pair constraints. The possible base pairing
constraints are:
Base pairing constraints have to be added to the sequence before you can use this option
- see below.
• Maximum distance between paired bases. Forces the algorithms to only consider RNA
structures of a given upper length by setting a maximum distance between the base pair
that opens a structure.
Using this procedure to add base pairing constraints will force the algorithm to compute minimum
free energy and structure with a stem in the selected region. The two regions must be of equal
length.
To prohibit two regions to form a stem, open the sequence and:
Select the two regions you want to prohibit by pressing Ctrl while selecting - (use
on Mac) | right-click the selection | Add Structure Prediction Constraints | Prohibit
Stem Here
This will add an annotation labeled "Prohibited Stem" to the sequence (see figure 24.6).
Using this procedure to add base pairing constraints will force the algorithm to compute minimum
free energy and structure without a stem in the selected region. Again, the two selected regions
must be of equal length.
To prohibit a region to be part of any base pair, open the sequence and:
CHAPTER 24. RNA STRUCTURE 544
Select the bases you don't want to base pair | right-click the selection | Add
Structure Prediction Constraints | Prohibit From Forming Base Pairs
This will add an annotation labeled "No base pairs" to the sequence, see 24.7.
Figure 24.7: Prohibiting any of the selected base from pairing with other bases.
Using this procedure to add base pairing constraints will force the algorithm to compute minimum
free energy and structure without a base pair containing any residues in the selected region.
When you click Predict secondary structure ( ) and click Next, check Apply base pairing
constraints in order to force or prohibit stem regions or prohibit regions from forming base pairs.
You can add multiple base pairing constraints, e.g. simultaneously adding forced stem regions
and prohibited stem regions and prohibit regions from forming base pairs.
This makes it possible to use the structure information in other analysis in the CLC Main
Workbench. You can e.g. align different sequences and compare their structure predictions.
Note that possibly existing structure annotation will be removed when a new structure is calculated
and added as annotations.
If you generate multiple structures, only the best structure will be added as annotations. If you
wish to add one of the sub-optimal structures as annotations, this can be done from the Show
Secondary Structure Table ( ) described in section 24.2.2.
CHAPTER 24. RNA STRUCTURE 545
• Annotations in the ordinary sequence views (Linear sequence view ( ), Annotation table
( ) etc. This is only possible if this has been chosen in the dialog in figure 24.2. See an
example in figure 24.8.
• A tabular view of the energy contributions of the elements in the structure. If more than
one structure have been predicted, the table is also used to switch between the structures
shown in the graphical view. The table is described in section 24.2.2.
Figure 24.9: The secondary structure view of an RNA sequence zoomed in.
Like the normal sequence view, you can use Zoom in ( ) and Zoom out ( ). Zooming in will
reveal the residues of the structure as shown in figure 24.9. For large structures, zooming out
will give you an overview of the whole structure.
• Follow structure selection. This setting pertains to the connection between the structures
in the secondary structure table ( ). If this option is checked, the structure displayed in
the secondary structure 2D view will follow the structure selections made in this table. See
section 24.2.2 for more information.
• Layout strategy. Specify the strategy used for the layout of the structure. In addition to
these strategies, you can also modify the layout manually as explained in the next section.
Auto. The layout is adjusted to minimize overlapping structure elements [Han et al.,
1999]. This is the default setting (see figure 24.10).
Proportional. Arc lengths are proportional to the number of residues (see figure 24.11).
Nothing is done to prevent overlap.
Even spread. Stems are spread evenly around loops as shown in figure 24.12.
• Reset layout. If you have manually modified the layout of the structure, clicking this button
will reset the structure to the way it was laid out when it was created.
Figure 24.11: Proportional layout. Length of the arc is proportional to the number of residues in
the arc.
Figure 24.12: Even spread. Stems are spread evenly around loops.
Press down the mouse button where the selection should start | move the mouse
cursor to where the selection should end | release the mouse button
One of the advantages of the secondary structure 2D view is that it is integrated with other views
of the same sequence. This means that any selection made in this view will be reflected in other
views (see figure 24.13).
Figure 24.13: A split view of the secondary structure view and a linear sequence view.
If you make a selection in another sequence view, this will will also be reflected in the secondary
structure view.
The CLC Main Workbench seeks to produce a layout of the structure where none of the elements
overlap. However, it may be desirable to manually edit the layout of a structure for ease of
understanding or for the purpose of publication.
To edit a structure, first select the Pan ( ) mode in the Tool bar (right-click on the zoom icon
below the View Area). Now place the mouse cursor on the opening of a stem, and a visual
indication of the anchor point for turning the substructure will be shown (see figure 24.14).
Figure 24.14: The blue circle represents the anchor point for rotating the substructure.
Click and drag to rotate the part of the structure represented by the line going from the anchor
point. In order to keep the bases in a relatively sequential arrangement, there is a restriction
CHAPTER 24. RNA STRUCTURE 548
on how much the substructure can be rotated. The highlighted part of the circle represents the
angle where rotating is allowed.
In figure 24.15, the structure shown in figure 24.14 has been modified by dragging with the
mouse.
Press Reset layout in the Side Panel to reset the layout to the way it looked when the structure
was predicted.
• If more than one structure is predicted (see section 24.1), the table provides an overview
of all the structures which have been predicted.
• With multiple structures you can use the table to determine which structure should be
displayed in the Secondary structure 2D view (see section 24.2.1).
• The table contains a hierarchical display of the elements in the structure with detailed
information about each element's energy contribution.
To show the secondary structure table of an already open sequence, click the Show Secondary
Structure Table ( ) button at the bottom of the sequence view.
If the sequence is not open, click Show ( ) and select Secondary Structure Table ( ).
This will open a view similar to the one shown in figure 24.16.
On the left side, all computed structures are listed with the information about structure name,
when the structure was created, the free energy of the structure and the probability of the structure
if the partition function was calculated. Selecting a row (equivalent: a structure) will display a
tree of the contained substructures with their contributions to the total structure free energy.
Each substructure contains a union of nested structure elements and other substructures (see
a detailed description of the different structure elements in section 24.5.2). Each substructure
CHAPTER 24. RNA STRUCTURE 549
Figure 24.16: The secondary structure table with the list of structures to the left, and to the right
the substructures of the selected structure.
contributes a free energy given by the sum of its nested substructure energies and energies of
its nested structure elements.
The substructure elements to the right are ordered after their occurrence in the sequence; they
are described by a region (the sequence positions covered by this substructure) and an energy
contribution. Three examples of mixed substructure elements are "Stem base pairs", "Stem with
bifurcation" and "Stem with hairpin".
The "Stem base pairs"-substructure is simply a union of stacking elements. It is given by a
joined set of base pair positions and an energy contribution displaying the sum of all stacking
element-energies.
The "Stem with bifurcation"-substructure defines a substructure enclosed by a specified base
pair with and with energy contribution ∆G. The substructure contains a "Stem base pairs"-
substructure and a nested bifurcated substructure (multi loop). Also bulge and interior loops can
occur separating stem regions.
The "Stem with hairpin"-substructure defines a substructure starting at a specified base pair
with an enclosed substructure-energy given by ∆G. The substructure contains a "Stem base
pairs"-substructure and a hairpin loop. Also bulge and interior loops can occur, separating stem
regions.
In order to describe the tree ordering of different substructures, we use an example as a starting
point (see figure 24.17).
The structure is a (disjoint) nested union of a "Stem with bifurcation"-substructure and a dangling
nucleotide. The nested substructure energies add up to the total energy. The "Stem with
bifurcation"-substructure is again a (disjoint) union of a "Stem base pairs"-substructure joining
position 1-7 with 64-70 and a multi loop structure element opened at base pair(7,64). To see
these structure elements, simply expand the "Stem with bifurcation" node (see figure 24.18).
The multi loop structure element is a union of three "Stem with hairpin"-substructures and
contributions to the multi loop opening considering multi loop base pairs and multi loop arcs.
Selecting an element in the table to the right will make a corresponding selection in the Show
Secondary Structure 2D View ( ) if this is also open and if the "Follow structure selection" has
been set in the editors side panel. In figure 24.18 the "Stem with bifurcation" is selected in the
table, and this part of the structure is high-lighted in the Secondary Structure 2D view.
CHAPTER 24. RNA STRUCTURE 550
Figure 24.17: A split view showing a structure table to the right and the secondary structure 2D
view to the left.
The correspondence between the table and the structure editor makes it easy to inspect the
thermodynamic details of the structure while keeping a visual overview as shown in the above
figures.
Handling multiple structures The table to the left offers a number of tools for working with
structures. Select a structure, right-click, and the following menu items will be available:
• Open Secondary Structure in 2D View ( ). This will open the selected structure in the
Secondary structure 2D view.
• Annotate Sequence with Secondary Structure. This will add the structure elements as
annotations to the sequence. Note that existing structure annotations will be removed.
• Rename Secondary Structure. This will allow you to specify a name for the structure to be
displayed in the table.
• Delete All Secondary Structures. This will delete all the selected structures. Note that
once you save and close the view, this operation is irreversible. As long as the view is
open, you can Undo ( ) the operation.
CHAPTER 24. RNA STRUCTURE 551
Figure 24.18: Now the "Stem with bifurcation" node has been selected in the table and a
corresponding selection has been made in the view of the secondary structure to the left.
Figure 24.19: The secondary structure visualized below the sequence and with annotations shown
above.
• Show all structures. If more than one structure is predicted, this option can be used if all
the structures should be displayed.
CHAPTER 24. RNA STRUCTURE 552
• Show first. If not all structures are shown, this can be used to determine the number of
structures to be shown.
• Sort by. When you select to display e.g. four out of eight structures, this option determines
which the "first four" should be.
Sort by ∆G.
Sort by name.
Sort by time of creation.
If these three options do not provide enough control, you can rename the structures in a
meaningful alphabetical way so that you can use the "name" to display the desired ones.
• Base pair symbol. How a base pair should be represented (see figure 24.19).
• Unpaired symbol. How bases which are not part of a base pair should be represented (see
figure 24.19).
• Height. When you zoom out, this option determines the height of the symbols as shown in
figure 24.20 (when zoomed in, there is no need for specifying the height).
When you zoom in and out, the appearance of the symbols change. In figure 24.19, the view
is zoomed in. In figure 24.20 you see the same sequence zoomed out to fit the width of the
sequence.
Figure 24.20: The secondary structure visualized below the sequence and with annotations shown
above. The view is zoomed out to fit the width of the sequence.
For both paired and unpaired bases, you can set the foreground color and the background color
to a gradient with the color at the left side indicating a probability of 0, and the color at the right
side indicating a probability of 1.
Note that you have to Zoom to 100% ( ) in order to see the coloring.
X
P (sH )
sH ∈SH P FH
P (H) = X = ,
P (s) P Ffull
s∈S
where P FH is the partition function calculated for all structures permissible by H (SH ) and P Ffull
is the full partition function. Calculating the probability can thus be done with two passes of the
partition function calculation, one with structural constraints, and one without. 24.21.
• Avoid isolated base pairs. The algorithm filters out isolated base pairs (i.e. stems of length
1).
CHAPTER 24. RNA STRUCTURE 554
Figure 24.22: Selecting RNA or DNA sequences for evaluating structure hypothesis.
• Apply different energy rules for Grossly Asymmetric Interior Loops (GAIL). Compute the
minimum free energy applying different rules for Grossly Asymmetry Interior Loops (GAIL). A
Grossly Asymmetry Interior Loop (GAIL) is an interior loop that is 1 × n or n × 1 where n > 2
(see http://mfold.rna.albany.edu/doc/mfold-manual/node5.php)
• Include coaxial stacking energy rules. Include free energy increments of coaxial stacking
for adjacent helices [Mathews et al., 2004].
24.3.2 Probabilities
After evaluation of the structure hypothesis an annotation is added to the input sequence.
This annotation covers the same region as the annotations that constituted the hypothesis and
contains information about the probability of the evaluated hypothesis (see figure 24.24).
Figure 24.24: This hypothesis has a probability of 0.338 as shown in the annotation.
CHAPTER 24. RNA STRUCTURE 555
If you have selected sequences before running the tool, those sequences will be listed in the
Selected Elements pane of the dialog. Use the arrows to add or remove sequences or sequence
lists from the selected elements.
Click Next to adjust scanning parameters (see figure 24.26).
The first group of parameters pertain to the methods of sequence resampling. There are four
ways of resampling, all described in detail in [Clote et al., 2005]:
• Dinucleotide shuffling. Shuffle method generating a sequence of the exact same dinu-
cleotide frequency
• Mononucleotide sampling from zero order Markov chain. Resampling method generating
a sequence of the same expected mononucleotide frequency.
CHAPTER 24. RNA STRUCTURE 556
• Dinucleotide sampling from first order Markov chain. Resampling method generating a
sequence of the same expected dinucleotide frequency.
The second group of parameters pertain to the scanning settings and include:
• Number of samples. The number of times the sequence is resampled to produce the
background distribution.
• Step increment. Step increment when plotting sequence positions against scoring values.
• P-values. Create a plot of the statistical significance of the structure signal as a function
of sequence position.
Figure 24.27: A plot of the Z-scores produced by sliding a window along a sequence.
paper is to describe a very popular way of doing this, namely free energy minimization. For an
in-depth review of algorithmic details, we refer the reader to Mathews and Turner, 2006.
Suboptimal structures determination A number of known factors violate the assumptions that
are implicit in MFE structure prediction. [Schroeder et al., 1999] and [Chen et al., 2004] have
shown experimental indications that the thermodynamic parameters are sequence dependent.
Moreover, [Longfellow et al., 1990] and [Kierzek et al., 1999], have demonstrated that some
structural elements show non-nearest neighbor effects. Finally, single stranded nucleotides in
multi loops are known to influence stability [Mathews and Turner, 2002].
These phenomena can be expected to limit the accuracy of RNA secondary structure prediction
by free energy minimization and it should be clear that the predicted MFE structure may deviate
CHAPTER 24. RNA STRUCTURE 559
somewhat from the actual preferred structure of the molecule. This means that it may be
informative to inspect the landscape of suboptimal structures which surround the MFE structure
to look for general structural properties which seem to be robust to minor variations in the total
free energy of the structure.
An effective procedure for generating a sample of suboptimal structures is given in [Zuker,
1989a]. This algorithm works by going through all possible Watson-Crick base pair in the
molecule. For each of these base pairs, the algorithm computes the most optimal structure
among all the structures that contain this pair, see figure 24.28.
Figure 24.28: A number of suboptimal structures have been predicted using CLC Main Workbench
and are listed at the top left. At the right hand side, the structural components of the selected
structure are listed in a hierarchical structure and on the left hand side the structure is displayed.
Figure 24.29: The different structure elements of RNA secondary structures predicted with the free
energy minimization algorithm in CLC Main Workbench. See text for a detailed description.
Nested structure elements The structure elements involving nested base pairs can be classified
by a given base pair and the other base pairs that are nested and accessible from this pair. For a
more elaborate description we refer the reader to [Sankoff et al., 1983] and [Zuker and Sankoff,
1984].
If the nucleotides with position number (i, j) form a base pair and i < k, l < j, then we say that
the base pair (k, l) is accessible from (i, j) if there is no intermediate base pair (i0 , j 0 ) such that
i < i0 < k, l < j 0 < j. This means that (k, l) is nested within the pair i, j and there is no other
base pair in between.
Using the number of accessible pase pairs, we can define the following distinct structure
elements:
1. Hairpin loop ( ). A base pair with 0 other accessible base pairs forms a hairpin loop. The
energy contribution of a hairpin is determined by the length of the unpaired (loop) region
CHAPTER 24. RNA STRUCTURE 561
and the two bases adjacent to the closing base pair which is termed a terminal mismatch
(see figure 24.29A).
2. A base pair with 1 accessible base pair can give rise to three distinct structure elements:
3. Multi loop opened ( ). A base pair with more than two accessible base pairs gives rise
to a multi loop, a loop from which three or more stems are opened (see figure 24.29E). The
energy contribution of a multi loop depends on the number of Stems opened in multi-loop
( ) that protrude from the loop.
• A collection of single stranded bases not accessible from any base pair is called an exterior
(or external) loop (see figure 24.29F). These regions do not contribute to the total free
energy.
• Non-GC terminating stem ( ). If a base pair other than a G-C pair is found at the end of
a stem, an energy penalty is assigned (see figure 24.29H).
Experimental constraints A number of techniques are available for probing RNA structures.
These techniques can determine individual components of an existing structure such as the
existence of a given base pair. It is possible to add such experimental constraints to the
secondary structure prediction based on free energy minimization (see figure 24.30) and it
has been shown that this can dramatically increase the fidelity of the secondary structure
prediction [Mathews and Turner, 2006].
Figure 24.30: Known structural features can be added as constraints to the secondary structure
prediction algorithm in CLC Main Workbench.
Chapter 25
Expression analysis
Contents
25.1 Experimental design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 564
25.1.1 Setting up a microarray experiment . . . . . . . . . . . . . . . . . . . . . 564
25.1.2 Organization of the experiment table . . . . . . . . . . . . . . . . . . . . 567
25.1.3 Adding annotations to an experiment . . . . . . . . . . . . . . . . . . . . 573
25.1.4 Scatter plot view of an experiment . . . . . . . . . . . . . . . . . . . . . 574
25.1.5 Cross-view selections . . . . . . . . . . . . . . . . . . . . . . . . . . . . 575
25.2 Transformation and normalization . . . . . . . . . . . . . . . . . . . . . . . . 577
25.2.1 Selecting transformed and normalized values for analysis . . . . . . . . 577
25.2.2 Transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 577
25.2.3 Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 578
25.3 Quality control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 580
25.3.1 Create Box Plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 581
25.3.2 Hierarchical Clustering of Samples . . . . . . . . . . . . . . . . . . . . . 584
25.3.3 Principal Component Analysis . . . . . . . . . . . . . . . . . . . . . . . . 589
25.4 Feature clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 592
25.4.1 Hierarchical clustering of features . . . . . . . . . . . . . . . . . . . . . 592
25.4.2 K-means/medoids clustering . . . . . . . . . . . . . . . . . . . . . . . . 596
25.5 Statistical analysis - identifying differential expression . . . . . . . . . . . . 600
25.5.1 Tests on proportions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 600
25.5.2 Gaussian-based tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . 602
25.5.3 Corrected p-values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 603
25.5.4 Volcano plots - inspecting the result of the statistical analysis . . . . . . 605
25.6 Annotation tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 607
25.6.1 Hypergeometric Tests on Annotations . . . . . . . . . . . . . . . . . . . 607
25.6.2 Gene Set Enrichment Analysis . . . . . . . . . . . . . . . . . . . . . . . 609
25.7 General plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 613
25.7.1 Histogram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 613
25.7.2 MA plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 615
25.7.3 Scatter plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 618
563
CHAPTER 25. EXPRESSION ANALYSIS 564
This section focuses on analysing expression data from sources such as microarrays using tools
found under the Expresson Analysis ( ) folder of the Tools menu. This includes tools for
performing quality control of the data, transformation and normalization, statistical analysis to
measure differential expression and annotation-based tests. A number of visualization tools such
as volcano plots, MA plots, scatter plots, box plots, and heat maps are also available to aid
interpretation of the results.
Tools for analysing RNA-Seq and small RNA data are available in the CLC Genomics Workbench.
Importing expression data into the Workbench as samples is described in appendix section H.
The first step towards analyzing this expression data is to create an Experiment, which contains
information about which samples belong to which groups.
• Experiment. At the top you can select a two-group experiment, and below you can select a
multi-group experiment and define the number of groups.
Note that you can also specify if the samples are paired. Pairing is relevant if you
have samples from the same individual under different conditions, e.g. before and after
treatment, or at times 0, 2, and 4 hours after treatment. In this case statistical analysis
becomes more efficient if effects of the individuals are taken into account, and comparisons
are carried out not simply by considering raw group means but by considering these corrected
CHAPTER 25. EXPRESSION ANALYSIS 565
Figure 25.1: Select the samples to use for setting up the experiment.
Figure 25.2: Defining the number of groups and expression value type.
for effects of the individual. If Paired is selected, a paired rather than a standard t-test
will be carried out for two group comparisons. For multiple group comparisons a repeated
measures rather than a standard ANOVA will be used.
• Expression values. If you choose to Set new expression value you can choose between
the following options depending on whether you look at the gene or transcript level:
Genes: Unique exon reads. The number of reads that match uniquely to the exons
(including the exon-exon and exon-intron junctions).
Genes: Unique gene reads. This is the number of reads that match uniquely to the
gene.
Genes: Total exon reads. Number of reads mapped to this gene that fall entirely
within an exon or in exon-exon or exon-intron junctions. As for the "Total gene reads"
CHAPTER 25. EXPRESSION ANALYSIS 566
this includes both uniquely mapped reads and reads with multiple matches that were
assigned to an exon of this gene.
Genes: Total gene reads. This is all the reads that are mapped to this gene, i.e., both
reads that map uniquely to the gene and reads that matched to more positions in the
reference (but fewer than the "Maximum number of hits for a read" parameter) which
were assigned to this gene.
Genes: RPKM. This is the expression value measured in RPKM [Mortazavi et al.,
total exon reads
2008]: RPKM = mapped reads(millions)×exon length (KB) . See exact definition below. Even if
you have chosen the RPKM values to be used in the Expression values column, they
will also be stored in a separate column. This is useful to store the RPKM if you switch
the expression measure.
Transcripts: Unique transcript reads. This is the number of reads in the mapping
for the gene that are uniquely assignable to the transcript. This number is calculated
after the reads have been mapped and both single and multi-hit reads from the read
mapping may be unique transcript reads.
Transcripts: Total transcript reads. Once the "Unique transcript read's" have been
identified and their counts calculated for each transcript, the remaining (non-unique)
transcript reads are assigned randomly to one of the transcripts to which they match.
The "Total transcript reads" counts are the total number of reads that are assigned
to the transcript once this random assignment has been done. As for the random
assignment of reads among genes, the random assignment of reads within a gene but
among transcripts, is done proportionally to the "unique transcript counts" normalized
by transcript length, that is, using the RPKM. Unique transcript counts of 0 are not
replaced by 1 for this proportional assignment of non-unique reads among transcripts.
Transcripts: RPKM. The RPKM value for the transcript, that is, the number of reads
assigned to the transcript divided by the transcript length and normalized by "Mapped
reads" (see below).
Depending on the number of groups selected in figure 25.2, you will see a list of groups with text
fields where you can enter an appropriate name for that group.
For multi-group experiments, if you find out that you have too many groups, click the Delete ( )
button. If you need more groups, simply click Add New Group.
Click Next when you have named the groups, and you will see figure 25.4.
CHAPTER 25. EXPRESSION ANALYSIS 567
This is where you define which group the individual sample belongs to. Simply select one or
more samples (by clicking and dragging the mouse), right-click (Ctrl-click on Mac) and select the
appropriate group.
Note that the samples are sorted alphabetically based on their names.
If you have chosen Paired in figure 25.2, there will be an extra column where you define which
samples belong together. Just as when defining the group membership, you select one or more
samples, right-click in the pairing column and select a pair.
Click Finish to start the tool.
Whenever you perform analyses like normalization, transformation, statistical analysis etc, new
columns will be added to the experiment. You can at any time Export ( ) all the data in the
experiment in csv or Excel format or Copy ( ) the full table or parts of it.
Column width
There are two options to specify the width of the columns and also the entire table:
• Automatic. This will fit the entire table into the width of the view. This is useful if you only
have a few columns.
• Manual. This will adjust the width of all columns evenly, and it will make the table as wide
as it needs to be to display all the columns. This is useful if you have many columns. In
this case there will be a scroll bar at the bottom, and you can manually adjust the width by
dragging the column separators.
Experiment level
The rest of the Side Panel is devoted to different levels of information on the values in the
experiment. The experiment part contains a number of columns that, for each feature ID, provide
summaries of the values across all the samples in the experiment (see figure 25.6).
Figure 25.6: The initial view of the experiment level for a two-group experiment.
• Range (original values). The 'Range' column contains the difference between the highest
and the lowest expression value for the feature over all the samples. If a feature has the
value NaN in one or more of the samples the range value is NaN.
• IQR (original values). The 'IQR' column contains the inter-quantile range of the values for a
feature across the samples, that is, the difference between the 75 %-ile value and the 25
%-ile value. For the IQR values, only the numeric values are considered when percentiles
are calculated (that is, NaN and +Inf or -Inf values are ignored), and if there are fewer than
four samples with numeric values for a feature, the IQR is set to be the difference between
the highest and lowest of these.
• Difference (original values). For a two-group experiment the 'Difference' column contains
the difference between the mean of the expression values across the samples assigned to
group 2 and the mean of the expression values across the samples assigned to group 1.
Thus, if the mean expression level in group 2 is higher than that of group 1 the 'Difference'
is positive, and if it is lower the 'Difference' is negative. For experiments with more than
two groups the 'Difference' contains the difference between the maximum and minimum of
the mean expression values of the groups, multiplied by -1 if the group with the maximum
mean expression value occurs before the group with the minimum mean expression value
(with the ordering: group 1, group 2, ...).
• Fold Change (original values). For a two-group experiment the 'Fold Change' tells you how
many times bigger the mean expression value in group 2 is relative to that of group 1.
If the mean expression value in group 2 is bigger than that in group 1 this value is the
mean expression value in group 2 divided by that in group 1. If the mean expression value
in group 2 is smaller than that in group 1 the fold change is the mean expression value
in group 1 divided by that in group 2 with a negative sign. Thus, if the mean expression
levels in group 1 and group 2 are 10 and 50 respectively, the fold change is 5, and if the
and if the mean expression levels in group 1 and group 2 are 50 and 10 respectively, the
fold change is -5. Entries of plus or minus infinity in the 'Fold Change' columns of the
Experiment area represent those where one of the expression values in the calculation is
a 0. For experiments with more than two groups, the 'Fold Change' column contains the
ratio of the maximum of the mean expression values of the groups to the minimum of the
mean expression values of the groups, multiplied by -1 if the group with the maximum mean
expression value occurs before the group with the minimum mean expression value (with
the ordering: group 1, group 2, ...).
Thus, the sign of the values in the 'Difference' and 'Fold change' columns give the direction of
the trend across the groups, going from group 1 to group 2, etc.
If the samples used are Affymetrix GeneChips samples and have 'Present calls' there will also
be a 'Total present count' column containing the number of present calls for all samples.
The columns under the 'Experiment' header are useful for filtering purposes, e.g. you may wish
to ignore features that differ too little in expression levels to be confirmed e.g. by qPCR by
filtering on the values in the 'Difference', 'IQR' or 'Fold Change' columns or you may wish to
ignore features that do not differ at all by filtering on the 'Range' column.
If you have performed normalization or transformation (see sections 25.2.3 and 25.2.2, respec-
tively), the IQR of the normalized and transformed values will also appear. Also, if you later
choose to transform or normalize your experiment, columns will be added for the transformed or
normalized values.
CHAPTER 25. EXPRESSION ANALYSIS 570
Note! It is very common to filter features on fold change values in expression analysis and fold
change values are also used in volcano plots, see section 25.5.4. There are different definitions
of 'Fold Change' in the literature. The definition that is used typically depends on the original
scale of the data that is analyzed. For data whose original scale is not the log scale the standard
definition is the ratio of the group means [Tusher et al., 2001]. This is the value you find in
the 'Fold Change' column of the experiment. However, for data whose original is the log scale,
the difference of the mean expression levels is sometimes referred to as the fold change [Guo
et al., 2006], and if you want to filter on fold change for these data you should filter on the
values in the 'Difference' column. Your data's original scale will e.g. be the log scale if you have
imported Affymetrix expression values which have been created by running the RMA algorithm on
the probe-intensities.
Analysis level
The results of each statistical test performed are in the columns listed in this area. In the table,
a heading is given for each test. Information about the results of statistical tests are described
in the statistical analysis section (see section 25.5).
An example of Analysis level settings is shown in figure 25.7.
Figure 25.7: An example of columns available under the Analysis level section.
Note: Some column names here are the same as ones under the Experiment level, but the results
here are from statistical tests, while those under the Experiment level section are calculations
carried out directly on the expression levels.
Annotation level
If your experiment is annotated (see section 25.1.3), the annotations will be listed in the
Annotation level group as shown in figure 25.8.
In order to avoid too much detail and cluttering the table, only a few of the columns are shown
CHAPTER 25. EXPRESSION ANALYSIS 571
per default.
Note that if you wish a different set of annotations to be displayed each time you open an
experiment, you need to save the settings of the Side Panel (see section 4.6).
Group level
At the group level, you can show/hide entire groups (Heart and Diaphragm in figure 25.5). This
will show/hide everything under the group's header. Furthermore, you can show/hide group-level
information like the group means and present count within a group. If you have performed
normalization or transformation (see sections 25.2.3 and 25.2.2, respectively), the means of
the normalized and transformed values will also appear.
An example is shown in figure 25.9.
Sample level
In this part of the side panel, you can control which columns to be displayed for each sample.
Initially this is the all the columns in the samples.
If you have performed normalization or transformation (see sections 25.2.3 and 25.2.2, respec-
tively), the normalized and transformed values will also appear.
CHAPTER 25. EXPRESSION ANALYSIS 572
Figure 25.10: Sample level when transformation and normalization has been performed.
Figure 25.11: Create a subset of the experiment by clicking the button at the bottom of the
experiment table.
This will create a new experiment that has the same information as the existing one but with less
features.
Figure 25.13: Adding annotations by clicking the button at the bottom of the experiment table.
This will bring up a dialog where you can select the annotation file that you have imported
together with the experiment you wish to annotate. Click Next to specify settings as shown in
figure 25.14).
In this dialog, you can specify how to match the annotations to the features in the sample. The
Workbench looks at the columns in the annotation file and lets you choose which column that
should be used for matching to the feature IDs in the experimental data (experiment or sample)
as well as for the annotations. Usually the default is right, but for some annotation files, you
need to select another column.
Some annotation files have leading zeros in the identifier which you can remove by checking the
Remove leading zeros box.
Note! Existing annotations on the experiment will be overwritten.
One of the views is the Scatter Plot ( ). The scatter plot can be adjusted to show e.g. the
group means for two groups (see more about how to adjust this below).
An example of a scatter plot is shown in figure 25.16.
Figure 25.16: A scatter plot of group means for two groups (transformed expression values).
In the Side Panel to the left, there are a number of options to adjust this view. Under Graph
preferences, you can adjust the general properties of the scatter plot:
• Lock axes This will always show the axes even though the plot is zoomed to a detailed
level.
CHAPTER 25. EXPRESSION ANALYSIS 575
• Tick type Determine whether tick lines should be shown outside or inside the frame.
• Tick lines at Choosing Major ticks will show a grid behind the graph.
• Horizontal axis range Sets the range of the horizontal axis (x axis). Enter a value in Min
and Max, and press Enter. This will update the view. If you wait a few seconds without
pressing Enter, the view will also be updated.
• Vertical axis range Sets the range of the vertical axis (y axis). Enter a value in Min and
Max, and press Enter. This will update the view. If you wait a few seconds without pressing
Enter, the view will also be updated.
• Draw x = y axis. This will draw a diagonal line across the plot. This line is shown per
default.
• Show Pearson correlation When checked, the Pearson correlation coefficient (r) is displayed
on the plot.
Below the general preferences, you find the Dot properties preferences, where you can adjust
coloring and appearance of the dots:
• Dot type Can be None, Cross, Plus, Square, Diamond, Circle, Triangle, Reverse triangle, or
Dot.
Finally, the group at the bottom - Values to plot - is where you choose the values to be displayed
in the graph. The default for a two-group experiment is to plot the group means.
Note that if you wish to use the same settings next time you open a scatter plot, you need to
save the settings of the Side Panel (see section 4.6).
Beside the Experiment table ( ) which is the default view, the views are: Scatter plot ( ),
Volcano plot ( ) and the Heat map ( ). By pressing and holding the Ctrl ( on Mac) button
while you click one of the view buttons in figure 25.17, you can make a split view. This will make
it possible to see e.g. the experiment table in one view and the volcano plot in another view.
An example of such a split view is shown in figure 25.18.
Figure 25.18: A split view showing an experiment table at the top and a volcano plot at the bottom
(note that you need to perform statistical analysis to show a volcano plot, see section 25.5).
Selections are shared between all these different views of an experiment. This means that if you
select a number of rows in the table, the corresponding dots in the scatter plot, volcano plot or
heatmap will also be selected. The selection can be made in any view, also the heat map, and
all other open views will reflect the selection.
A common use of the split views is where you have an experiment and have performed a statistical
analysis. You filter the experiment to identify all genes that have an FDR corrected p-value below
0.05 and a fold change for the test above say, 2. You can select all the rows in the experiment
table satisfying these filters by holding down the Cntrl button and clicking 'a'. If you have a split
view of the experiment and the volcano plot all points in the volcano plot corresponding to the
selected features will be red. Note that the volcano plot allows two sets of values in the columns
under the test you are considering to be displayed on the x-axis: the 'Fold change's and the
'Difference's. You control which to plot in the side panel. If you have filtered on 'Fold change' you
will typically want to choose 'Fold change' in the side panel. If you have filtered on 'Difference'
(e.g. because your original data is on the log scale, see the note on fold change in 25.1.2) you
typically want to choose 'Difference'.
CHAPTER 25. EXPRESSION ANALYSIS 577
Figure 25.19: Selecting which version of the expression values to analyze. In this case, the values
have not been normalized, so it is not possible to select normalized values.
In this case, the values have not been normalized, so it is not possible to select normalized
values.
25.2.2 Transformation
The CLC Main Workbench lets you transform expression values based on logarithm and adding a
constant:
Tools | Expression Analysis ( )| Transformation and Normalization | Transform
( )
Select a number of samples ( ( ) or ( )) or an experiment ( ) and click Next.
CHAPTER 25. EXPRESSION ANALYSIS 578
At the top, you can select which values to transform (see section 25.2.1).
Next, you can choose three kinds of transformation:
10.
2.
Natural logarithm.
25.2.3 Normalization
The CLC Main Workbench lets you normalize expression values.
To start the normalization:
Tools | Expression Analysis ( )| Transformation and Normalization | Normalize ( )
Select a number of samples ( ( ) or ( )) or an experiment ( ) and click Next.
This will display a dialog as shown in figure 25.21.
At the top, you can choose three kinds of normalization (for mathematical descriptions see
[Bolstad et al., 2003]):
CHAPTER 25. EXPRESSION ANALYSIS 579
• Scaling. The sets of the expression values for the samples will be multiplied by a constant
so that the sets of normalized values for the samples have the same 'target' value (see
description of the Normalization value below).
• Quantile. The empirical distributions of the sets of expression values for the samples are
used to calculate a common target distribution, which is used to calculate normalized sets
of expression values for the samples.
• By totals. This option is intended to be used with count-based data, i.e. data from small
RNA or expression profiling by tags. A sum is calculated for the expression values in a
sample. The transformed value are generated by dividing the input values by the sample
sum and multiplying by the factor (e.g. per '1,000,000').
Figures 25.22 and 25.23 show the effect on the distribution of expression values when using
scaling or quantile normalization, respectively.
At the bottom of the dialog in figure 25.21, you can select which values to normalize (see
section 25.2.1).
Clicking Next will display a dialog as shown in figure 25.24.
The following parameters can be set:
CHAPTER 25. EXPRESSION ANALYSIS 580
• Normalization value. The type of value of the samples which you want to ensure are equal
for the normalized expression values
Mean.
Median.
• Reference. The specific value that you want the normalized value to be after normalization.
Median mean.
Median median.
Use another sample.
• Trimming percentage. Expression values that lie below the value of this percentile, or
above 100 minus the value of this percentile, in the empirical distribution of the expression
values in a sample will be excluded when calculating the normalization and reference
values.
samples, and may be used to spot unwanted systematic differences between samples, outlying
samples and samples of poor quality, that you may want to exclude.
Here you select which values to use in the box plot (see section 25.2.1).
Click Finish to start the tool.
preferences, you can adjust the general properties of the box plot (see figure 25.27).
• Lock axes This will always show the axes even though the plot is zoomed to a detailed
level.
• Frame Shows a frame around the graph.
• Show legends Shows the data legends.
• Tick type Determine whether tick lines should be shown outside or inside the frame.
• Tick lines at Choosing Major ticks will show a grid behind the graph.
• Vertical axis range Sets the range of the vertical axis (y axis). Enter a value in Min and
Max, and press Enter. This will update the view. If you wait a few seconds without pressing
Enter, the view will also be updated.
• Draw median line. This is the default - the median is drawn as a line in the box.
CHAPTER 25. EXPRESSION ANALYSIS 583
• Draw mean line. Alternatively, you can also display the mean value as a line.
• Show outliers. The values outside the whiskers range are called outliers. Per default they
are not shown. Note that the dot type that can be set below only takes effect when outliers
are shown. When you select and deselect the Show outliers, the vertical axis range is
automatically re-calculated to accommodate the new values.
Below the general preferences, you find the Lines and dots preferences, where you can adjust
coloring and appearance (see figure 25.28).
• Select sample or group. When you wish to adjust the properties below, first select an item
in this drop-down menu. That will apply the changes below to this item. If your plot is based
on an experiment, the drop-down menu includes both group names and sample names, as
well as an entry for selecting "All". If your plot is based on single elements, only sample
names will be visible. Note that there are sometimes "mixed states" when you select a
group where two of the samples e.g. have different colors. Selecting a new color in this
case will erase the differences.
• Dot type Can be None, Cross, Plus, Square, Diamond, Circle, Triangle, Reverse triangle, or
Dot.
Note that if you wish to use the same settings next time you open a box plot, you need to save
the settings of the Side Panel (see section 4.6).
Figure 25.29: Box plot for an experiment with 5 groups and 27 samples.
4. iterating 2-3 until there is only one cluster left (which will contain all samples).
The tree is drawn so that the distances between clusters are reflected by the lengths of the
branches in the tree. Thus, features with expression profiles that closely resemble each other
have short distances between them, those that are more different, are placed further apart.
(See [Eisen et al., 1998] for a classical example of application of a hierarchical clustering
algorithm in microarray analysis. The example is on features rather than samples).
To start the clustering:
Tools | Expression Analysis ( )| Quality Control ( ) | Hierarchical Clustering of
Samples ( )
Select a number of samples ( ( ) or ( )) or an experiment ( ) and click Next.
This will display a dialog as shown in figure 25.32. The hierarchical clustering algorithm requires
that you specify a distance measure and a cluster linkage. The similarity measure is used to
specify how distances between two samples should be calculated. The cluster distance metric
specifies how you want the distance between two clusters, each consisting of a number of
samples, to be calculated.
• Euclidean distance. The length of the segment connecting two points. If u = (u1 , u2 , . . . , un )
and v = (v1 , v2 , . . . , vn ), then the Euclidean distance between u and v is
v
u n
uX
|u − v| = t (ui − vi )2 .
i=1
• Manhattan distance. The distance between two points measured along axes at right
angles. If u = (u1 , u2 , . . . , un ) and v = (v1 , v2 , . . . , vn ), then the Manhattan distance
between u and v is
n
X
|u − v| = |ui − vi |.
i=1
CHAPTER 25. EXPRESSION ANALYSIS 586
where x/y and sx /sy are the average and sample standard deviation, respectively, of the
values in x/y values.
The Pearson correlation coefficient ranges from -1 to 1, with high absolute values indicating
strong correlation, and values near 0 suggesting little to no relationship between the
elements.
Using 1 - | Pearson correlation | as the distance measure ensures that highly correlated
elements have a shorter distance, while elements with low correlation are farther apart.
The distance between two clusters is determined using one of the following linkage types:
• Single linkage. The distance between the two closest elements in the two clusters.
• Average linkage. The average distance between elements in the first cluster and elements
in the second cluster.
• Complete linkage. The distance between the two farthest elements in the two clusters.
At the bottom, you can select which values to cluster (see section 25.2.1).
Click Finish to start the tool.
Note: To be run on a server, the tool has to be included in a workflow, and the results will be
displayed in a a stand-alone new heat map rather than added into the input experiment table.
If you have used an experiment ( ) and ran the non-workflow version of the tool, the clustering
is added to the experiment and will be saved when you save the experiment. It can be viewed by
clicking the Show Heat Map ( ) button at the bottom of the view (see figure 25.34).
CHAPTER 25. EXPRESSION ANALYSIS 587
If you have run the workflow version of the tool, or selected a number of samples ( ( ) or ( ))
as input, a new element will be created that has to be saved separately.
Regardless of the input, the view of the clustering is the same. As you can see in figure 25.33,
there is a tree at the bottom of the view to visualize the clustering. The names of the samples
are listed at the top. The features are represented as horizontal lines, colored according to the
expression level. If you place the mouse on one of the lines, you will see the names of the
feature to the left. The features are sorted by their expression level in the first sample (in order
to cluster the features, see section 25.4.1).
Researchers often have a priori knowledge of which samples in a study should be similar (e.g.
samples from the same experimental condition) and which should be different (samples from
biological distinct conditions). Thus, researches have expectations about how they should cluster.
Samples that are placed unexpectedly in the hierarchical clustering tree may be samples that
have been wrongly allocated to a group, samples of unintended or unclean tissue composition
or samples for which the processing has gone wrong. Unexpectedly placed samples, of course,
could also be highly interesting samples.
There are a number of options to change the appearance of the heat map. At the top of the Side
Panel, you find the Heat map preference group (see figure 25.35).
At the top, there is information about the heat map currently displayed. The information regards
type of clustering, expression value used together with distance and linkage information. If you
have performed more than one clustering, you can choose between the resulting heat maps in a
drop-down box (see figure 25.36).
Note that if you perform an identical clustering, the existing heat map will simply be replaced.
Below this box, there is a number of settings for displaying the heat map.
CHAPTER 25. EXPRESSION ANALYSIS 588
Figure 25.36: When more than one clustering has been performed, there will be a list of heat maps
to choose from.
• Lock width to window. When you zoom in the heat map, you will per default only zoom in
on the vertical level. This is because the width of the heat map is locked to the window.
If you uncheck this option, you will zoom both vertically and horizontally. Since you always
have more features than samples, it is useful to lock the width since you then have all the
samples in view all the time.
• Lock height to window. This is the corresponding option for the height. Note that if you
check both options, you will not be able to zoom at all, since both the width and the height
is fixed.
• Lock headers and footers. This will ensure that you are always able to see the sample and
feature names and the trees when you zoom in.
• Colors. The expression levels are visualized using a gradient color scheme, where the
right side color is used for high expression levels and the left side color is used for low
expression levels. You can change the coloring by clicking the box, and you can change the
relative coloring of the values by dragging the two knobs on the white slider above.
Below you find the Samples and Features groups. They contain options to show names, legend,
and tree above or below the heat map. Note that for clustering of samples, you find the tree
options in the Samples group, and for clustering of features, you find the tree options in the
Features group. With the tree options, you can also control the Tree size, from tiny to very large,
and the option of showing the full tree, no matter how much space it will use.
For clustering of features, the Features group has an option to "Optimize tree layout". This
attempts to reorder the features, consistently with the tree, such that the most expressed
features form a diagonal from the top-left to the bottom-right of the heat map.
The Samples group contains an "Order by:" dropdown that allows re-ordering of the columns of
the heat map. When clustering by samples it is possible to choose between using the "Tree" to
determine the sample ordering, and showing the "Samples" in the order they were input to the
tool. When clustering by features, only the "Samples" input order is available.
CHAPTER 25. EXPRESSION ANALYSIS 589
Note that if you wish to use the same settings next time you open a heat map, you need to save
the settings of the Side Panel (see section 4.6).
Figure 25.37: Selecting which values the principal component analysis should be based on.
In this dialog, you select the values to be used for the principal component analysis (see
section 25.2.1).
Click Finish to start the tool.
matrix rather than the covariance matrix by choosing 'Correlation scatter plot'. Both plots will
show how the samples separate along the two directions between which the samples exhibit the
largest amount of variation. For the 'projection scatter plot' this variation is measured in absolute
terms, and depends on the units in which you have measured your samples. The correlation
scatter plot is a normalized version of the projection scatter plot, which makes it possible to
compare principal component analysis between experiments, even when these have not been
done using the same units (e.g an experiment that uses 'original' scale data and another one
that uses 'log-scale' data).
The plot in figure 25.38 is based on a two-group experiment. The group relationships are indicated
by color. We expect the samples from within a group to exhibit less variability when compared,
than samples from different groups. Thus samples should cluster according to groups and this is
what we see. The PCA plot is thus helpful in identifying outlying samples and samples that have
been wrongly assigned to a group.
In the Side Panel to the left, there is a number of options to adjust the view. Under Graph
preferences, you can adjust the general properties of the plot.
• Lock axes This will always show the axes even though the plot is zoomed to a detailed
level.
• Tick type Determine whether tick lines should be shown outside or inside the frame.
• Tick lines at Choosing Major ticks will show a grid behind the graph.
• Horizontal axis range Sets the range of the horizontal axis (x axis). Enter a value in Min
and Max, and press Enter. This will update the view. If you wait a few seconds without
pressing Enter, the view will also be updated.
• Vertical axis range Sets the range of the vertical axis (y axis). Enter a value in Min and
Max, and press Enter. This will update the view. If you wait a few seconds without pressing
Enter, the view will also be updated.
• y = 0 axis. Draws a line where y = 0. Below there are some options to control the
appearance of the line:
CHAPTER 25. EXPRESSION ANALYSIS 591
• Select sample or group. When you wish to adjust the properties below, first select an item
in this drop-down menu. That will apply the changes below to this item. If your plot is based
on an experiment, the drop-down menu includes both group names and sample names, as
well as an entry for selecting "All". If your plot is based on single elements, only sample
names will be visible. Note that there are sometimes "mixed states" when you select a
group where two of the samples e.g. have different colors. Selecting a new color in this
case will erase the differences.
• Dot type Can be None, Cross, Plus, Square, Diamond, Circle, Triangle, Reverse triangle, or
Dot.
• Show name. This will show a label with the name of the sample next to the dot. Note that
the labels quickly get crowded, so that is why the names are not put on per default.
Note that if you wish to use the same settings next time you open a principal component plot,
you need to save the settings of the Side Panel (see section 4.6).
Scree plot
Besides the view shown in figure 25.38, the result of the principal component can also be viewed
as a scree plot by clicking the Show Scree Plot ( ) button at the bottom of the view. The scree
plot shows the proportion of variation in the data explained by each of the principal components.
The first principal component accounts for the largest part of the variability.
In the Side Panel to the left, there is a number of options to adjust the view. Under Graph
preferences, you can adjust the general properties of the plot.
• Lock axes This will always show the axes even though the plot is zoomed to a detailed
level.
• Tick type Determine whether tick lines should be shown outside or inside the frame.
• Tick lines at Choosing Major ticks will show a grid behind the graph.
• Horizontal axis range Sets the range of the horizontal axis (x axis). Enter a value in Min
and Max, and press Enter. This will update the view. If you wait a few seconds without
pressing Enter, the view will also be updated.
CHAPTER 25. EXPRESSION ANALYSIS 592
• Vertical axis range Sets the range of the vertical axis (y axis). Enter a value in Min and
Max, and press Enter. This will update the view. If you wait a few seconds without pressing
Enter, the view will also be updated.
• Dot type Can be None, Cross, Plus, Square, Diamond, Circle, Triangle, Reverse triangle, or
Dot.
Note that the graph title and the axes titles can be edited simply by clicking them with the mouse.
These changes will be saved when you Save ( ) the graph - whereas the changes in the Side
Panel need to be saved explicitly (see section 4.6).
4. iterating 2-3 until there is only one cluster left (which will contain all samples).
The tree is drawn so that the distances between clusters are reflected by the lengths of the
branches in the tree. Thus, features with expression profiles that closely resemble each other
have short distances between them, those that are more different, are placed further apart.
To start the clustering of features:
Tools | Expression Analysis ( )| Feature Clustering ( ) | Hierarchical Clustering
of Features ( )
CHAPTER 25. EXPRESSION ANALYSIS 593
• Euclidean distance. The length of the segment connecting two points. If u = (u1 , u2 , . . . , un )
and v = (v1 , v2 , . . . , vn ), then the Euclidean distance between u and v is
v
u n
uX
|u − v| = t (ui − vi )2 .
i=1
• Manhattan distance. The distance between two points measured along axes at right
angles. If u = (u1 , u2 , . . . , un ) and v = (v1 , v2 , . . . , vn ), then the Manhattan distance
between u and v is
n
X
|u − v| = |ui − vi |.
i=1
where x/y and sx /sy are the average and sample standard deviation, respectively, of the
values in x/y values.
CHAPTER 25. EXPRESSION ANALYSIS 594
The Pearson correlation coefficient ranges from -1 to 1, with high absolute values indicating
strong correlation, and values near 0 suggesting little to no relationship between the
elements.
Using 1 - | Pearson correlation | as the distance measure ensures that highly correlated
elements have a shorter distance, while elements with low correlation are farther apart.
The distance between two clusters is determined using one of the following linkage types:
• Single linkage. The distance between the two closest elements in the two clusters.
• Average linkage. The average distance between elements in the first cluster and elements
in the second cluster.
• Complete linkage. The distance between the two farthest elements in the two clusters.
At the bottom, you can select which values to cluster (see section 25.2.1).
Click Finish to start the tool.
If you have used an experiment ( ) as input, the clustering is added to the experiment and will
be saved when you save the experiment. It can be viewed by clicking the Show Heat Map ( )
button at the bottom of the view (see figure 25.41).
If you have selected a number of samples ( ( ) or ( )) as input, a new element will be created
that has to be saved separately.
Regardless of the input, a hierarchical tree view with associated heatmap is produced (figure
25.40). In the heatmap each row corresponds to a feature and each column to a sample. The
CHAPTER 25. EXPRESSION ANALYSIS 595
color in the i'th row and j'th column reflects the expression level of feature i in sample j (the
color scale can be set in the side panel). The order of the rows in the heatmap are determined by
the hierarchical clustering. If you place the mouse on one of the rows, you will see the name of
the corresponding feature to the left. The order of the columns (that is, samples) is determined
by their input order or (if defined) experimental grouping. The names of the samples are listed at
the top of the heatmap and the samples are organized into groups.
There are a number of options to change the appearance of the heat map. At the top of the Side
Panel, you find the Heat map preference group (see figure 25.42).
At the top, there is information about the heat map currently displayed. The information regards
type of clustering, expression value used together with distance and linkage information. If you
have performed more than one clustering, you can choose between the resulting heat maps in a
drop-down box (see figure 25.43).
Note that if you perform an identical clustering, the existing heat map will simply be replaced.
Below this box, there is a number of settings for displaying the heat map.
• Lock width to window. When you zoom in the heat map, you will per default only zoom in
on the vertical level. This is because the width of the heat map is locked to the window.
If you uncheck this option, you will zoom both vertically and horizontally. Since you always
have more features than samples, it is useful to lock the width since you then have all the
samples in view all the time.
• Lock height to window. This is the corresponding option for the height. Note that if you
check both options, you will not be able to zoom at all, since both the width and the height
is fixed.
• Lock headers and footers. This will ensure that you are always able to see the sample and
CHAPTER 25. EXPRESSION ANALYSIS 596
Figure 25.43: When more than one clustering has been performed, there will be a list of heat maps
to choose from.
• Colors. The expression levels are visualized using a gradient color scheme, where the
right side color is used for high expression levels and the left side color is used for low
expression levels. You can change the coloring by clicking the box, and you can change the
relative coloring of the values by dragging the two knobs on the white slider above.
Below you find the Samples and Features groups. They contain options to show names, legend,
and tree above or below the heat map. Note that for clustering of samples, you find the tree
options in the Samples group, and for clustering of features, you find the tree options in the
Features group. With the tree options, you can also control the Tree size, from tiny to very large,
and the option of showing the full tree, no matter how much space it will use.
For clustering of features, the Features group has an option to "Optimize tree layout". This
attempts to reorder the features, consistently with the tree, such that the most expressed
features form a diagonal from the top-left to the bottom-right of the heat map.
The Samples group contains an "Order by:" dropdown that allows re-ordering of the columns of
the heat map. When clustering by samples it is possible to choose between using the "Tree" to
determine the sample ordering, and showing the "Samples" in the order they were input to the
tool. When clustering by features, only the "Samples" input order is available.
Note that if you wish to use the same settings next time you open a heat map, you need to save
the settings of the Side Panel (see section 4.6).
K-means. K-means clustering assigns each point to the cluster whose center is
nearest. The center/centroid of a cluster is defined as the average of all points
in the cluster. If a data set has three dimensions and the cluster has two points
X = (x1 , x2 , x3 ) and Y = (y1 , y2 , y3 ), then the centroid Z becomes Z = (z1 , z2 , z3 ),
where zi = (xi + yi )/2 for i = 1, 2, 3. The algorithm attempts to minimize the
intra-cluster variance defined by:
k X
X
V = (xj − µi )2
i=1 xj ∈Si
Manhattan distance. The Manhattan distance between two elements is the distance
measured along axes at right angles. If u = (u1 , u2 , . . . , un ) and v = (v1 , v2 , . . . , vn ),
then the Manhattan distance between u and v is
Xn
|u − v| = |ui − vi |.
i=1
• Subtract mean value. For each gene, subtract the mean gene expression value over all
input samples.
At the top, you can choose the Level to use. Choosing 'sample values' means that distances will
be calculated using all the individual values of the samples. When 'group means' are chosen,
distances are calculated using the group means.
CHAPTER 25. EXPRESSION ANALYSIS 599
At the bottom, you can select which values to cluster (see section 25.2.1).
Click Finish to start the tool.
The k-means implementation first assigns each feature to a cluster at random. Then, at each
iteration, it reassigns features to the centroid of the nearest cluster. During this reassignment, it
can happen that one or more of the clusters becomes empty, explaining why the final number of
clusters might be smaller than the one specified in "number of partitions". Note that the initial
assignment of features to clusters is random, so results can differ when the algorithm is run
again.
The samples used are from a time-series experiment, and you can see that the expression levels
for each cluster have a distinct pattern. The two clusters at the bottom have falling and rising
expression levels, respectively, and the two clusters at the top both fall at the beginning but then
rise again (the one to the right starts to rise earlier that the other one).
Having inspected the graphs, you may wish to take a closer look at the features represented in
each cluster. In the experiment table, the clustering has added an extra column with the name
of the cluster that the feature belongs to. In this way you can filter the table to see only features
from a specific cluster. This also means that you can select the feature of this cluster in a
volcano or scatter plot as described in section 25.1.5.
specify which of the groups you want to use as reference (the default is to use the group you
specified as Group 1 when you set up the experiment).
Note that the proportion-based tests use the total sample counts (that is, the sum over all
expression values). If one (or more) of the counts are NaN, the sum will be NaN and all the
test statistics will be NaN. As a consequence all p-values will also be NaN. You can avoid this
by filtering your experiment and creating a new experiment so that no NaN values are present,
before you apply the tests.
T-tests
For experiments with two groups you can, among the Gaussian tests, only choose a T-test as
shown in figure 25.47.
There are different types of t-tests, depending on the assumption you make about the variances
in the groups. By selecting 'Homogeneous' (the default) calculations are done assuming that the
groups have equal variances. When 'In-homogeneous' is selected, this assumption is not made.
The t-test can also be chosen if you have a multi-group experiment. In this case you may choose
CHAPTER 25. EXPRESSION ANALYSIS 603
either to have t-tests produced for all pairs of groups (by clicking the 'All pairs' button) or to
have a t-test produced for each group compared to a specified reference group (by clicking the
'Against reference' button). In the last case you must specify which of the groups you want to
use as reference (the default is to use the group you specified as Group 1 when you set up the
experiment).
If a experiment with pairing was set up (see section 25.1.1) the Use pairing tick box is active. If
ticked, paired t-tests will be calculated, if not, the formula for the standard t-test will be used.
When a t-test is run on an experiment four columns will be added to the experiment table for
each pair of groups that are analyzed. The 'Difference' column contains the difference between
the mean of the expression values across the samples assigned to group 2 and the mean of
the expression values across the samples assigned to group 1. The 'Fold Change' column tells
you how many times bigger the mean expression value in group 2 is relative to that of group 1.
If the mean expression value in group 2 is bigger than that in group 1 this value is the mean
expression value in group 2 divided by that in group 1. If the mean expression value in group 2
is smaller than that in group 1 the fold change is the mean expression value in group 1 divided
by that in group 2 with a negative sign. The 'Test statistic' column holds that value of the test
statistic, and the 'P-value' holds the two-sided p-value for the test. Up to two more columns may
be added if the options to calculate Bonferroni and FDR corrected p-values were chosen.
ANOVA
For experiments with more than two groups you can choose T-test, see section 25.5.2, or
ANOVA.
The ANOVA method allows analysis of an experiment with one factor and a number of groups,
e.g. different types of tissues, or time points. In the analysis, the variance within groups is
compared to the variance between groups. You get a significant result (that is, a small ANOVA
p-value) if the difference you see between groups relative to that within groups, is larger than
what you would expect, if the data were really drawn from groups with equal means.
If an experiment with pairing was set up (see section 25.1.1) the Use pairing tick box is active.
If ticked, a repeated measures one-way ANOVA test will be calculated, if not, the formula for the
standard one-way ANOVA will be used.
When an ANOVA analysis is run on an experiment four columns will be added to the experiment
table for each pair of groups that are analyzed. The 'Max difference' column contains the
difference between the maximum and minimum of the mean expression values of the groups,
multiplied by -1 if the group with the maximum mean expression value occurs before the group
with the minimum mean expression value (with the ordering: group 1, group 2, ...). The 'Max fold
change' column contains the ratio of the maximum of the mean expression values of the groups
to the minimum of the mean expression values of the groups, multiplied by -1 if the group with the
maximum mean expression value occurs before the group with the minimum mean expression
value (with the ordering: group 1, group 2, ...). The 'Test statistic' column holds the value of the
test statistic, and the 'P-value' holds the two-sided p-value for the test. Up to two more columns
may be added if the options to calculate Bonferroni and FDR corrected p-values were chosen.
At the top, you can select which values to analyze (see section 25.2.1).
Below you can select to add two kinds of corrected p-values to the analysis (in addition to the
standard p-value produced for the test statistic):
• Bonferroni corrected.
• FDR corrected.
Both are calculated from the original p-values, and aim in different ways to take into account the
issue of multiple testing [Dudoit et al., 2003]. The problem of multiple testing arises because
the original p-values are related to a single test: the p-value is the probability of observing a more
extreme value than that observed in the test carried out. If the p-value is 0.04, we would expect
an as extreme value as that observed in 4 out of 100 tests carried out among groups with no
difference in means. Popularly speaking, if we carry out 10000 tests and select the features with
original p-values below 0.05, we will expect about 0.05 times 10000 = 500 to be false positives.
The Bonferroni corrected p-values handle the multiple testing problem by controlling the 'family-
wise error rate': the probability of making at least one false positive call. They are calculated by
multiplying the original p-values by the number of tests performed. The probability of having at
least one false positive among the set of features with Bonferroni corrected p-values below 0.05,
is less than 5%. The Bonferroni correction is conservative: there may be many genes that are
differentially expressed among the genes with Bonferroni corrected p-values above 0.05, that will
be missed if this correction is applied.
Instead of controlling the family-wise error rate we can control the false discovery rate: FDR. The
false discovery rate is the proportion of false positives among all those declared positive. We
expect 5 % of the features with FDR corrected p-values below 0.05 to be false positive. There
are many methods for controlling the FDR - the method used in CLC Main Workbench is that
of [Benjamini and Hochberg, 1995].
Click Finish to start the tool.
Note that if you have already performed statistical analysis on the same values, the existing one
will be overwritten.
CHAPTER 25. EXPRESSION ANALYSIS 605
The volcano plot shows the relationship between the p-values of a statistical test and the
magnitude of the difference in expression values of the samples in the groups. On the y-axis
the − log10 p-values are plotted. For the x-axis you may choose between two sets of values by
choosing either 'Fold change' or 'Difference' in the volcano plot side panel's 'Values' part. If
you choose 'Fold change' the log of the values in the 'fold change' (or 'Weighted fold change')
column for the test will be displayed. If you choose 'Difference' the values in the 'Difference' (or
'Weighted difference') column will be used. Which values you wish to display will depend upon
the scale of you data (Read the note on fold change in section 25.1.2).
The larger the difference in expression of a feature, the more extreme it's point will lie on
CHAPTER 25. EXPRESSION ANALYSIS 606
the X-axis. The more significant the difference, the smaller the p-value and thus the higher
the − log10 (p) value. Thus, points for features with highly significant differences will lie high
in the plot. Features of interest are typically those which change significantly and by a certain
magnitude. These are the points in the upper left and upper right hand parts of the volcano plot.
If you have performed different tests or you have an experiment with multiple groups you need to
specify for which test and which group comparison you want the volcano plot to be shown. You
do this in the 'Test' and 'Values' parts of the volcano plot side panel.
Options for the volcano plot are described in further detail when describing the Side Panel below.
If you place your mouse on one of the dots, a small text box will tell the name of the feature.
Note that you can zoom in and out on the plot (see section 2.2).
In the Side Panel to the right, there is a number of options to adjust the view of the volcano plot.
Under Graph preferences, you can adjust the general properties of the volcano plot
• Lock axes This will always show the axes even though the plot is zoomed to a detailed
level.
• Frame Shows a frame around the graph.
• Show legends Shows the data legends.
• Tick type Determine whether tick lines should be shown outside or inside the frame.
• Tick lines at Choosing Major ticks will show a grid behind the graph.
• Horizontal axis range Sets the range of the horizontal axis (x axis). Enter a value in Min
and Max, and press Enter. This will update the view. If you wait a few seconds without
pressing Enter, the view will also be updated.
• Vertical axis range Sets the range of the vertical axis (y axis). Enter a value in Min and
Max, and press Enter. This will update the view. If you wait a few seconds without pressing
Enter, the view will also be updated.
Below the general preferences, you find the Dot properties, where you can adjust coloring and
appearance of the dots.
• Dot type Can be None, Cross, Plus, Square, Diamond, Circle, Triangle, Reverse triangle, or
Dot.
• Dot color Click the color box to select a color.
At the very bottom, you find two groups for choosing which values to display:
• Test. In this group, you can select which kind of test you want the volcano plot to be shown
for.
• Values. Under Values, you can select which values to plot. If you have multi-group
experiments, you can select which groups to compare. You can also select whether to plot
Difference or Fold change on the x-axis. Read the note on fold change in section 25.1.2.
Note that if you wish to use the same settings next time you open a box plot, you need to save
the settings of the Side Panel (see section 4.6).
CHAPTER 25. EXPRESSION ANALYSIS 607
At the top, you select which annotation to use for testing. You can select from all the annotations
available on the experiment, but it is of course only a few that are biologically relevant. Once you
have selected an annotation, you will see the number of features carrying this annotation below.
Annotations are typically given at the gene level. Often a gene is represented by more than one
feature in an experiment. If this is not taken into account it may lead to a biased result. The
standard way to deal with this is to reduce the set of features considered, so that each gene is
represented only once. In the next step, Remove duplicates, you can choose the basis on which
the feature set will be reduced:
Highest IQR. The feature with the highest interquartile range (IQR) is kept.
Highest value. The feature with the highest expression value is kept.
First you specify which annotation you want to use as gene identifier. Once you have selected this,
you will see the number of features carrying this annotation below. Next you specify which feature
you want to keep for each gene. This may be either the feature with the highest inter-quartile
range or the highest value.
At the bottom, you can select which values to analyze (see section 25.2.1). Only features that
have a numerical value assigned to them will be used for the analysis. That is, any feature which
has a value of plus infinity, minus infinity or NaN will not be included in the feature list taken into
the test. Thus, the choice of value at this step can affect the features that are taken forward into
the test in two ways:
• If there are features with values of plus infinity, minus infinity or NaN, those features will
not be taken forward into the test. This can be a consideration when choosing transformed
values, where the mathematical manipulations involved may lead to such values.
• If you chose to remove duplicates, then the value type you choose here is the value used
for checking the highest IQR or value to determine which feature is taken forward into the
test.
• Description. This is the description belonging to the category. Both of these are simply
extracted from the annotations.
CHAPTER 25. EXPRESSION ANALYSIS 609
• Full set. The number of features in the original experiment (not the subset) with this
category. (Note that this is after removal of duplicates).
• In subset. The number of features in the subset with this category. (Note that this is after
removal of duplicates).
• Expected in subset. The number of features we would have expected to find with this
annotation category in the subset, if the subset was a random draw from the full set.
• p-value. The tail probability of the hyper geometric distribution This is the value used for
sorting the table.
Categories with small p-values are over-represented on the features in the subset relative to the
full set.
GO terms are organized in a hierarchical structure. For example, the term "GO:0033151 V(D)J
recombination" from the Gene Ontology [Ashburner et al., 2000, The Gene Ontology Consortium,
2019] (https://geneontology.org/) is a descendant of "GO:0006259 DNA metabolic
process".
When testing for the significance of a particular GO term, all features linked to descendant GO
terms are included in the test. This can lead to a higher number of detected genes in the output
table, compared to the number of genes linked to the tested GO term.
Due to the hierarchical structure, GO terms are not independent of one another, and the p-values
provided in the table should be interpreted with caution.
somewhat arbitrary - using a larger or smaller p-value cut-off will result in including more or less.
Also, the magnitudes of differential expression of the genes is not considered.
The Gene Set Enrichment Analysis (GSEA) does NOT take a sublist of differentially expressed
genes and compare it to the full list - it takes a single gene list (a single experiment). The
idea behind GSEA is to consider a measure of association between the genes and phenotype
of interest (e.g. test statistic for differential expression) and rank the genes according to this
measure of association. A test is then carried out for each annotation category, for whether the
ranks of the genes in the category are evenly spread throughout the ranked list, or tend to occur
at the top or bottom of the list.
The GSEA test implemented here is that of [Tian et al., 2005]. The test implicitly calculates and
uses a standard t-test statistic for two-group experiments, and ANOVA statistic for multiple group
experiments for each feature, as measures of association. For each category, the test statistics
for the features in than category are summed and a category based test statistic is calculated
as this sum divided by the square root of the number of features in the category. Note that if a
feature has the value NaN in one of the samples, the t-test statistic for the feature will be NaN.
Consequently, the combined statistic for each of the categories in which the feature is included
will be NaN. Thus, it is advisable to filter out any feature that has a NaN value before applying
GSEA.
The p-values for the GSEA test statistics are calculated by permutation: The original test statistics
for the features are permuted and new test statistics are calculated for each category, based on
the permuted feature test statistics. This is done the number of times specified by the user in
the wizard. For each category, the lower and upper tail probabilities are calculated by comparing
the original category test statistics to the distribution of the permutation-based test statistics for
that category. The lower and higher tail probabilities are the number of these that are lower and
higher, respectively, than the observed value, divided by the number of permutations.
As the p-values are based on permutations you may some times see results where category x's
test statistic is lower than that of category y and the categories are of equal size, but where the
lower tail probability of category x is higher than that of category y. This is due to imprecision
in the estimations of the tail probabilities from the permutations. The higher the number of
permutations, the more stable the estimation.
You may run a GSEA on a full experiment, or on a sub-experiment where you have filtered away
features that you think are un-informative and represent only noise. Typically you will remove
features that are constant across samples (those for which the value in the 'Range' column is
zero' --- these will have a t-test statistic of zero) and/or those for which the inter-quantile range is
small. As the GSEA algorithm calculates and ranks genes on p-values from a test of differential
expression, it will generally not make sense to filter the experiment on p-values produced in an
analysis if differential expression, prior to running GSEA on it.
Tools | Expression Analysis ( )| Annotation Test ( ) | Gene Set Enrichment
Analysis (GSEA) ( )
Select an experiment and click Next.
Click Next. This will display the dialog shown in figure 25.52.
At the top, you select which annotation to use for testing. You can select from all the annotations
available on the experiment, but it is of course only a few that are biologically relevant. Once you
have selected an annotation, you will see the number of features carrying this annotation below.
CHAPTER 25. EXPRESSION ANALYSIS 611
In addition, you can set a filter: Minimum size required. Only categories with more genes (i.e.
features) than the specified number will be considered. Excluding categories with small numbers
of genes may lead to more robust results.
Annotations are typically given at the gene level. Often a gene is represented by more than one
feature in an experiment. If this is not taken into account it may lead to a biased result. The
standard way to deal with this is to reduce the set of features considered, so that each gene is
represented only once. Check the Remove duplicates check box to reduce the feature set, and
you can choose how you want this to be done:
Highest IQR. The feature with the highest interquartile range (IQR) is kept.
Highest value. The feature with the highest expression value is kept.
First you specify which annotation you want to use as gene identifier. Once you have selected this,
you will see the number of features carrying this annotation below. Next you specify which feature
you want to keep for each gene. This may be either the feature with the highest inter-quartile
range or the highest value.
Clicking Next will display the dialog shown in figure 25.53.
At the top, you can select which values to analyze (see section 25.2.1).
Below, you can set the Permutations for p-value calculation. For the GSEA test a p-value is
calculated by permutation: p permuted data sets are generated, each consisting of the original
features, but with the test statistics permuted. The GSEA test is run on each of the permuted
data sets. The test statistic is calculated on the original data, and the resulting value is compared
to the distribution of the values obtained for the permuted data sets. The permutation based
p-value is the number of permutation based test statistics above (or below) the value of the
test statistic for the original data, divided by the number of permuted data sets. For reliable
permutation-based p-value calculation a large number of permutations is required (100 is the
default).
CHAPTER 25. EXPRESSION ANALYSIS 612
Result of gene set enrichment analysis The result of performing gene set enrichment analysis
using GO biological process is shown in figure 25.54.
Figure 25.54: The result of gene set enrichment analysis on GO biological process.
• Description. This is the description belonging to the category. Both of these are simply
extracted from the annotations.
• Size. The number of features with this category. (Note that this is after removal of
duplicates).
• Lower tail. This is the mass in the permutation based p-value distribution below the value
of the test statistic.
• Upper tail. This is the mass in the permutation based p-value distribution above the value
of the test statistic.
CHAPTER 25. EXPRESSION ANALYSIS 613
A small lower (or upper) tail p-value for an annotation category is an indication that features in
this category viewed as a whole are perturbed among the groups in the experiment considered.
GO terms are organized in a hierarchical structure. For example, the term "GO:0033151 V(D)J
recombination" from the Gene Ontology [Ashburner et al., 2000, The Gene Ontology Consortium,
2019] (https://geneontology.org/) is a descendant of "GO:0006259 DNA metabolic
process".
When testing for the significance of a particular GO term, all features linked to descendant GO
terms are included in the test. This can lead to a higher number of detected genes in the output
table, compared to the number of genes linked to the tested GO term.
25.7.1 Histogram
A histogram shows a distribution of a set of values. Histograms are often used for examining
and comparing distributions, e.g. of expression values of different samples, in the quality control
step of an analysis. You can create a histogram showing the distribution of expression value for
a sample:
Tools | Expression Analysis ( )| General Plots ( ) | Create Histogram ( )
Select a number of samples ( ( ), ( ), ( )) or a graph track. When you have selected more
than one sample, a histogram will be created for each one. Clicking Next will display a dialog as
shown in figure 25.55.
Figure 25.55: Selecting which values the histogram should be based on.
In this dialog, you select the values to be used for creating the histogram (see section 25.2.1).
Click Finish to start the tool.
Viewing histograms
The resulting histogram is shown in a figure 25.56
The histogram shows the expression value on the x axis (in the case of figure 25.56 the
transformed expression values) and the counts of these values on the y axis.
CHAPTER 25. EXPRESSION ANALYSIS 614
In the Side Panel to the left, there is a number of options to adjust the view. Under Graph
preferences, you can adjust the general properties of the plot.
• Lock axes This will always show the axes even though the plot is zoomed to a detailed
level.
• Tick type Determine whether tick lines should be shown outside or inside the frame.
• Tick lines at Choosing Major ticks will show a grid behind the graph.
• Horizontal axis range Sets the range of the horizontal axis (x axis). Enter a value in Min
and Max, and press Enter. This will update the view. If you wait a few seconds without
pressing Enter, the view will also be updated.
• Vertical axis range Sets the range of the vertical axis (y axis). Enter a value in Min and
Max, and press Enter. This will update the view. If you wait a few seconds without pressing
Enter, the view will also be updated.
• Break points. Determines where the bars in the histogram should be:
Sturges method. This is the default. The number of bars is calculated from the range
of values by Sturges formula [Sturges, 1926].
Equi-distanced bars. This will show bars from Start to End and with a width of Sep.
Number of bars. This will simply create a number of bars starting at the lowest value
and ending at the highest value.
Below the graph preferences, you find Line color. Allows you to choose between many different
colors. Click the color box to select a color.
CHAPTER 25. EXPRESSION ANALYSIS 615
Note that if you wish to use the same settings next time you open a principal component plot,
you need to save the settings of the Side Panel (see section 4.6).
Besides the histogram view itself, the histogram can also be shown in a table, summarizing key
properties of the expression values. An example is shown in figure 25.57.
25.7.2 MA plot
The MA plot is a scatter rotated by 45◦ . For two samples of expression values it plots for each
gene the difference in expression against the mean expression level. MA plots are often used for
quality control, in particular, to assess whether normalization and/or transformation is required.
You can create an MA plot comparing two samples:
Tools | Expression Analysis ( )| General Plots ( ) | Create MA Plot ( )
In the first two dialogs, select two samples ( ( ), ( ) or ( )): the first must be the case
expression data, and the second the control data. Clicking Next will display a dialog as shown in
figure 25.58.
In this dialog, you select the values to be used for creating the MA plot (see section 25.2.1).
Click Finish to start the tool.
Viewing MA plots
The resulting plot is shown in a figure 25.59.
CHAPTER 25. EXPRESSION ANALYSIS 616
Figure 25.58: Selecting which values the MA plot should be based on.
The X axis shows the mean expression level of a feature on the two samples and the Y axis
shows the difference in expression levels for a feature on the two samples. From the plot shown
in figure 25.59 it is clear that the variance increases with the mean. With an MA plot like this,
you will often choose to transform the expression values (see section 25.2.2).
Figure 25.60 shows the same two samples where the MA plot has been created using log2
transformed values.
The much more symmetric and even spread indicates that the dependance of the variance on
the mean is not as strong as it was before transformation.
CHAPTER 25. EXPRESSION ANALYSIS 617
In the Side Panel to the left, there is a number of options to adjust the view. Under Graph
preferences, you can adjust the general properties of the plot.
• Lock axes This will always show the axes even though the plot is zoomed to a detailed
level.
• Tick type Determine whether tick lines should be shown outside or inside the frame.
• Tick lines at Choosing Major ticks will show a grid behind the graph.
• Horizontal axis range Sets the range of the horizontal axis (x axis). Enter a value in Min
and Max, and press Enter. This will update the view. If you wait a few seconds without
pressing Enter, the view will also be updated.
• Vertical axis range Sets the range of the vertical axis (y axis). Enter a value in Min and
Max, and press Enter. This will update the view. If you wait a few seconds without pressing
Enter, the view will also be updated.
• y = 0 axis. Draws a line where y = 0. Below there are some options to control the
appearance of the line:
Below the general preferences, you find the Dot properties preferences, where you can adjust
coloring and appearance of the dots:
• Dot type Can be None, Cross, Plus, Square, Diamond, Circle, Triangle, Reverse triangle, or
Dot.
Note that if you wish to use the same settings next time you open a scatter plot, you need to
save the settings of the Side Panel (see section 4.6).
CHAPTER 25. EXPRESSION ANALYSIS 618
Figure 25.61: Selecting which values the scatter plot should be based on.
BLAST search
Contents
26.1 Running BLAST searches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 620
26.1.1 BLAST at NCBI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 620
26.1.2 BLAST against local data . . . . . . . . . . . . . . . . . . . . . . . . . . 623
26.2 Output from BLAST searches . . . . . . . . . . . . . . . . . . . . . . . . . . . 627
26.2.1 Graphical overview for each query sequence . . . . . . . . . . . . . . . . 627
26.2.2 Overview BLAST table . . . . . . . . . . . . . . . . . . . . . . . . . . . . 627
26.2.3 BLAST graphics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 629
26.2.4 BLAST HSP table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 630
26.2.5 BLAST hit table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 632
26.2.6 Extracting a consensus sequence from a BLAST result . . . . . . . . . . 633
26.3 Local BLAST databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 633
26.3.1 Download NCBI pre-formatted BLAST databases . . . . . . . . . . . . . . 634
26.3.2 Make pre-formatted BLAST databases available . . . . . . . . . . . . . . 634
26.3.3 Create local BLAST databases . . . . . . . . . . . . . . . . . . . . . . . 635
26.4 Manage BLAST databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . 636
26.5 Bioinformatics explained: BLAST . . . . . . . . . . . . . . . . . . . . . . . . . 637
26.5.1 How does BLAST work? . . . . . . . . . . . . . . . . . . . . . . . . . . . 638
26.5.2 Which BLAST program should I use? . . . . . . . . . . . . . . . . . . . . 640
26.5.3 Which BLAST options should I change? . . . . . . . . . . . . . . . . . . 640
26.5.4 Where can I get the BLAST+ programs . . . . . . . . . . . . . . . . . . . 641
26.5.5 What you cannot get out of BLAST . . . . . . . . . . . . . . . . . . . . . 642
26.5.6 Other useful resources . . . . . . . . . . . . . . . . . . . . . . . . . . . 642
CLC Main Workbench offers to conduct BLAST searches on protein and DNA sequences. In short,
a BLAST search identifies homologous sequences between your input (query) query sequence and
a database of sequences [McGinnis and Madden, 2004]. BLAST (Basic Local Alignment Search
Tool), identifies homologous sequences using a heuristic method which finds short matches
between two sequences. After initial match BLAST attempts to start local alignments from these
initial matches.
619
CHAPTER 26. BLAST SEARCH 620
If you are interested in the bioinformatics behind BLAST, there is an easy-to-read explanation of
this in section 26.5.
Figure 26.1 shows an example of a BLAST result in the CLC Main Workbench.
Figure 26.1: Display of the output of a BLAST search. At the top is there a graphical representation
of BLAST hits with tool-tips showing additional information on individual hits. Below is a tabular
form of the BLAST results.
In the first wizard step, select one or more sequences or sequence lists of the same type, DNA
or protein (figure 26.2).
Figure 26.2: Specify one or more query sequences or sequence lists for the BLAST search.
In the next wizard step, specify the type of BLAST search to run and the database to search
(figure 26.3). Only databases relevant to the selected search type will be listed.
Figure 26.3: Specify the type of search to run and the database to search.
• blastn: DNA sequence against a DNA database. Searches for DNA sequences with
homologous regions to your nucleotide query sequence.
• blastp: Protein sequence against Protein database. Used to look for peptide sequences
with homologous regions to your peptide query sequence.
• tblastn: Protein sequence against Translated DNA database. Peptide query sequences
are searched against an automatically translated, in six frames, DNA database.
CHAPTER 26. BLAST SEARCH 622
Note: Hits found in the Protein Data Bank proteins (pdb) database, can be downloaded and
opened with the 3D view.
In the following wizard step, the settings for the search can be refined (figure 26.4).
Figure 26.4: The settings for the BLAST search can be customized.
If blastx is selected as the program to use, an option for specifying the genetic code to use for
translating the query sequence will be available. If tblastx is selected, options for the genetic
code to use for translating the database and for translating the query sequences will be available.
BLAST search parameters are described below. See https://blast.ncbi.nlm.nih.gov/
doc/blast-topics/ for further details.
• Limit by Entrez query. BLAST searches can be limited to the results of an Entrez query
against the database chosen. This can be used to limit searches to subsets of entries in
the BLAST databases. Any terms can be entered that would normally be allowed in an En-
trez search session. More information about Entrez queries can be found at https://www.
ncbi.nlm.nih.gov/books/NBK3837/#EntrezHelp.Entrez_Searching_Options.
The syntax described there is the same as would be accepted in the CLC interface. Some
commonly used Entrez queries are pre-entered and can be chosen in the drop down menu.
• Mask low complexity regions. Mask off segments of the query sequence that have low
compositional complexity.
• Mask low complexity regions. Mask off segments of the query sequence that have low
compositional complexity. Filtering can eliminate statistically significant, but biologically
uninteresting reports from the BLAST output (e.g. hits against common acidic-, basic- or
proline-rich regions), leaving the more biologically interesting regions of the query sequence
available for specific matching against database sequences.
• Expect. The threshold for reporting matches against database sequences. The Expect value
(E-value) describes the number of hits one can expect to see matching a query by chance
when searching against a database of a given size. If the E-value ascribed to a match is
greater than the value entered in the Expect field, the match will not be reported. Details
of how E-values are calculated can be found at the NCBI: https://www.ncbi.nlm.
nih.gov/BLAST/tutorial/Altschul-1.html. Lower thresholds are more stringent,
CHAPTER 26. BLAST SEARCH 623
leading to fewer chance matches being reported. Increasing the threshold results in more
matches being reported, but many may just matching by chance, not due to any biological
similarity. Values lower than 1 can be entered as decimals, or in scientific notiation. For
example, 0.001, 1e-3 and 10e-4 would be equivalent and acceptable values.
• Word Size. BLAST is a heuristic that works by finding word-matches between the query
and database sequences. You may think of this process as finding "hot-spots" that BLAST
can then use to initiate extensions that might lead to full-blown alignments. For nucleotide-
nucleotide searches (i.e. "BLASTn") an exact match of the entire word is required before
an extension is initiated, so that you normally regulate the sensitivity and speed of the
search by increasing or decreasing the wordsize. For other BLAST searches non-exact word
matches are taken into account based upon the similarity between words. The amount of
similarity can be varied so that you normally uses just the wordsizes 2 and 3 for these
searches.
• Gap Cost. The pull down menu shows the Gap Costs (Penalty to open Gap and penalty to
extend Gap). Increasing the Gap Costs and Lambda ratio will result in alignments which
decrease the number of Gaps introduced.
• Max number of hit sequences. The maximum number of database sequences, where
BLAST found matches to your query sequence, to be included in the BLAST report.
The parameters you choose will affect how long BLAST takes to run. A search of a small database,
requesting only hits that meet stringent criteria will generally be quite quick. Searching large
databases, or allowing for very remote matches, will of course take longer.
Click Finish to start the tool.
BLAST a partial sequence against NCBI You can search a database using only a part of a
sequence directly from the sequence view:
select the sequence region to send to BLAST | right-click the selection | BLAST
Selection Against NCBI ( )
This will go directly to the dialog shown in figure 26.3 and the rest of the options are the same
as when performing a BLAST search with a full sequence.
• It can be faster.
On a technical level, CLC Main Workbench uses the NCBI's blast+ software (see https:
//ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/). Thus, the results
of using a particular data set to search the same database with the same search parameters
would give the same results, whether run locally or at the NCBI.
There are a number of options for what you can search against:
• You can create a database based on data already imported into your Workbench (see
section 26.3.3)
• You can add pre-formatted databases (see section 26.3.2)
• You can use sequence data from the Navigation Area directly, without creating a database
first.
Select one or more sequences of the same type (DNA or protein) and click Next.
This opens the dialog seen in figure 26.6:
At the top, you can choose between different BLAST programs.
BLAST programs for DNA query sequences:
• blastn: DNA sequence against a DNA database. Searches for DNA sequences with
homologous regions to your nucleotide query sequence.
• blastx: Translated DNA sequence against a Protein database. Automatic translation of
your DNA query sequence in six frames; these translated sequences are then used to
search a protein database.
• tblastx: Translated DNA sequence against a Translated DNA database. Automatic
translation of your DNA query sequence and the DNA database, in six frames. The resulting
peptide query sequences are used to search the resulting peptide database. Note that this
type of search is computationally intensive.
CHAPTER 26. BLAST SEARCH 625
• blastp: Protein sequence against Protein database. Used to look for peptide sequences
with homologous regions to your peptide query sequence.
• tblastn: Protein sequence against Translated DNA database. Peptide query sequences
are searched against an automatically translated, in six frames, DNA database.
In cases where you have selected blastx or tblastx to conduct a search, you will get the option of
selecting a translation table for the genetic code. The standard genetic code is set as default.
This setting is particularly useful when working with organisms or organelles that have a genetic
code that differs from the standard genetic code.
If you search against the Protein Data Bank database and homologous sequences are found to
the query sequence, these can be downloaded and opened with the 3D Molecule Viewer (see
section 15.1.3).
You then specify the target database to use:
• Sequences. When you choose this option, you can use sequence data from the Navigation
Area as database by clicking the Browse and select icon ( ). A temporary BLAST
database will be created from these sequences and used for the BLAST search. It is
deleted afterwards. If you want to be able to click in the BLAST result to retrieve the hit
sequences from the BLAST database at a later point, you should not use this option; create
a create a BLAST database first, see section 26.3.3.
• BLAST Database. Select a database already available in one of your designated BLAST
database folders. Read more in section 26.4.
The next dialog allows you to adjust the parameters to meet the requirements of your BLAST
search (figure 26.7).
Figure 26.7: Parameters that can be set before submitting a local BLAST search.
• Number of threads. You can specify the number of threads, which should be used if your
Workbench is installed on a multi-threaded system.
• Mask low complexity regions. Mask off segments of the query sequence that have low
compositional complexity. Filtering can eliminate statistically significant, but biologically
uninteresting reports from the BLAST output (e.g. hits against common acidic-, basic- or
proline-rich regions), leaving the more biologically interesting regions of the query sequence
available for specific matching against database sequences.
• Expect. The threshold for reporting matches against database sequences. The Expect value
(E-value) describes the number of hits one can expect to see matching a query by chance
when searching against a database of a given size. If the E-value ascribed to a match is
greater than the value entered in the Expect field, the match will not be reported. Details
of how E-values are calculated can be found at the NCBI: https://www.ncbi.nlm.
nih.gov/BLAST/tutorial/Altschul-1.html. Lower thresholds are more stringent,
leading to fewer chance matches being reported. Increasing the threshold results in more
matches being reported, but many may just matching by chance, not due to any biological
similarity. Values lower than 1 can be entered as decimals, or in scientific notiation. For
example, 0.001, 1e-3 and 10e-4 would be equivalent and acceptable values.
• Word Size. BLAST is a heuristic that works by finding word-matches between the query
and database sequences. You may think of this process as finding "hot-spots" that BLAST
can then use to initiate extensions that might lead to full-blown alignments. For nucleotide-
nucleotide searches (i.e. "BLASTn") an exact match of the entire word is required before
an extension is initiated, so that you normally regulate the sensitivity and speed of the
search by increasing or decreasing the wordsize. For other BLAST searches non-exact word
matches are taken into account based upon the similarity between words. The amount of
similarity can be varied so that you normally uses just the wordsizes 2 and 3 for these
searches.
you are searching with (see the BLAST Frequently Asked Questions). Only applicable for
protein sequences or translated DNA sequences.
• Gap Cost. The pull down menu shows the Gap Costs (Penalty to open Gap and penalty to
extend Gap). Increasing the Gap Costs and Lambda ratio will result in alignments which
decrease the number of Gaps introduced.
• Max number of hit sequences. The maximum number of database sequences, where
BLAST found matches to your query sequence, to be included in the BLAST report.
• Filter out redundant results. This option culls HSPs on a per subject sequence basis by
removing HSPs that are completely enveloped by another HSP.
BLAST a partial sequence against a local database You can search a database using only a
part of a sequence directly from the sequence view:
select the region that you wish to BLAST | right-click the selection | BLAST
Selection Against Local Database ( )
This will go directly to the dialog shown in figure 26.6 and the rest of the options are the same
as when performing a BLAST search with a full sequence.
Figure 26.8: Default display of the output of a BLAST search for one query sequence. At the top
is there a graphical representation of BLAST hits with tooltips showing additional information on
individual hits.
Figure 26.9: An overview BLAST table summarizing the results for a number of query sequences.
Double-clicking a row will open the BLAST result for this query sequence, allowing more detailed
investigation of the result. You can also select one or more rows and click the Open BLAST
Output button at the bottom of the view. Consensus sequence can be extracted by clicking
the Extract Consensus button at the bottom. Clicking the Open Query Sequence will open a
sequence list with the selected query sequences. This can be useful in work flows where BLAST
is used as a filtering mechanism where you can filter the table to include e.g. sequences that
have a certain top hit and then extract those.
In the overview table, the following information is shown:
• Query: Since this table displays information about several query sequences, the first column
is the name of the query sequence.
• Number of HSPs: The number of High-scoring Segment Pairs (HSPs) for this query sequence.
• For the following list, the value of the best HSP is displayed together with accession number
and description of this HSP, with respect to E-value, identity or positive value, hit length or
bit score.
Lowest E-value
Accession (E-value)
Description (E-value)
CHAPTER 26. BLAST SEARCH 629
Greatest identity %
Accession (identity %)
Description (identity %)
Greatest positive %
Accession (positive %)
Description (positive %)
Greatest HSPs length
Accession (HSP length)
Description (HSP length)
Greatest bit score
Accession (bit score)
Description (bit score)
If you wish to save some of the BLAST results as individual elements in the Navigation Area,
open them and click Save As in the File menu.
• Blast layout. You can control the level of Compactness for displaying sequences:
You can also choose to Gather sequences at top. Enabling this option affects the view that
is shown when scrolling horizontally along a BLAST result. If selected, the sequence hits
which did not contribute to the visible part of the BLAST graphics will be omitted whereas
the found BLAST hits will automatically be placed right below the query sequence.
• BLAST hit coloring. You can choose whether to color hit sequences and adjust the coloring
scale for visualisation of identity level.
The remaining View preferences for BLAST Graphics are the same as those of alignments.
See section 14.2.
Some of the information available in the tooltips when hovering over a particular hit sequence is:
CHAPTER 26. BLAST SEARCH 630
• Name of sequence. Here is shown some additional information of the sequence which
was found. This line corresponds to the description line in GenBank (if the search was
conducted on the nr database).
• Score. This shows the bit score of the local alignment generated through the BLAST search.
• Expect. Also known as the E-value. A low value indicates a homologous sequence. Higher
E-values indicate that BLAST found a less homologous sequence.
• Identities. This number shows the number of identical residues or nucleotides in the
obtained alignment.
• Gaps. This number shows whether the alignment has gaps or not.
• Strand. This is only valid for nucleotide sequences and show the direction of the aligned
strands. Minus indicate a complementary strand.
The numbers of the query and subject sequences refer to the sequence positions in the submitted
and found sequences. If the subject sequence has number 59 in front of the sequence, this
means that 58 residues are found upstream of this position, but these are not included in the
alignment.
By right clicking the sequence name in the Graphical BLAST output it is possible to download the
full hits sequence from NCBI with accompanying annotations and information. It is also possible
to just open the actual hit sequence in a new view.
Figure 26.10: BLAST HSP Table. The HSPs can be sorted by the different columns, simply by
clicking the column heading.
CHAPTER 26. BLAST SEARCH 631
• Query sequence. The sequence which was used for the search.
• E-value. Measure of quality of the match. Higher E-values indicate that BLAST found a less
homologous sequence.
• Score. This shows the score of the local alignment generated through the BLAST search.
• Bit score. This shows the bit score of the local alignment generated through the BLAST
search. Bit scores are normalized, which means that the bit scores from different alignments
can be compared, even if different scoring matrices have been used.
• Overlap. Display a percentage value for the overlap of the query sequence and HSP
sequence. Only the length of the local alignment is taken into account and not the full
length query sequence.
• Identity. Shows the number of identical residues in the query and HSP sequence.
• %Identity. Shows the percentage of identical residues in the query and HSP sequence.
• Positive. Shows the number of similar but not necessarily identical residues in the query
and HSP sequence.
• %Positive. Shows the percentage of similar but not necessarily identical residues in the
query and HSP sequence.
• Gaps. Shows the number of gaps in the query and HSP sequence.
• %Gaps. Shows the percentage of gaps in the query and HSP sequence.
In the BLAST table view you can handle the HSP sequences. Select one or more sequences from
the table, and apply one of the following functions.
CHAPTER 26. BLAST SEARCH 632
• Download and Open. Download the full sequence from NCBI and opens it. If multiple
sequences are selected, they will all open (if the same sequence is listed several times,
only one copy of the sequence is downloaded and opened).
• Download and Save. Download the full sequence from NCBI and save it. When you click
the button, there will be a save dialog letting you specify a folder to save the sequences. If
multiple sequences are selected, they will all open (if the same sequence is listed several
times, only one copy of the sequence is downloaded and opened).
• Open at NCBI. Opens the corresponding sequence(s) at GenBank at NCBI. Here is stored
additional information regarding the selected sequence(s). The default Internet browser is
used for this purpose.
• Open structure. If the HSP sequence contain structure information, the sequence is
opened in a text view or a 3D view. Note that the 3D view has special system requirements,
see section 1.3.
The HSPs can be sorted by the different columns, simply by clicking the column heading. In cases
where individual rows have been selected in the table, the selected rows will still be selected
after sorting the data.
You can do a text-based search in the information in the BLAST table by using the filter at the
upper right part of the view. In this way you can search for e.g. species or other information which
is typically included in the "Description" field.
The table is integrated with the graphical view described in section 26.2.3 so that selecting a
HSP in the table will make a selection on the corresponding sequence in the graphical view.
Figure 26.11: BLAST Hit Table. The hits can be sorted by the different columns, simply by clicking
the column heading.
CHAPTER 26. BLAST SEARCH 633
• Query sequence. The sequence which was used for the search.
• Hit. The Name of the sequences found in the BLAST search.
• Id. GenBank ID.
• Description. Text from NCBI describing the sequence.
• Total Score. Total score for all HSPs.
• Max Score. Maximum score of all HSPs.
• Min E-value. Minimum e-value of all HSPs.
• Max Bit score. Maximum Bit score of all HSPs.
• Max Identity. Shows the maximum number of identical residues in the query and Hit
sequence.
• Max %Identity. Shows the percentage of maximum identical residues in the query and Hit
sequence.
• Max Positive. Shows the maximum number of similar but not necessarily identical residues
in the query and Hit sequence.
• Max %Positive. Shows the percentage of maximum similar but not necessarily identical
residues in the query and Hit sequence.
• Download pre-formatted BLAST databases from the NCBI using Download BLAST Databases
(see section 26.3.1).
• Specify locations where BLAST databases are stored at your site using Manage Blast
Databases (see section 26.4).
• Use Create BLAST Database to create databases using sequences or sequence lists
selected from the Workbench Navigation Area (see section 26.3.3).
For BLAST searches against a small amount of sequence data, a sequence list can be specified
instead of a database when launching the BLAST tool (see section 26.1.2). A database for those
sequences will be created as part of that job. This adds to the overall execution time, so if those
sequences will be used for multiple searches, creating a BLAST database and referring to that
when launching searches is likely to be preferable.
CHAPTER 26. BLAST SEARCH 634
Figure 26.12: Choose from pre-formatted BLAST databases at the NCBI available for download.
If there is more than one BLAST database location configured, you will be able to specify which
one to store the BLAST database files in. See section 26.4 for details about adding BLAST
database locations.
• Add the location where BLAST database files are stored as a BLAST database location (see
section 26.4).
OR
• Move the files that make up the BLAST database to a location already configured as a
BLAST databse location. All the files that comprise a given BLAST database must be
moved. This may be as few as three files, but can be more (figure 26.13).
CHAPTER 26. BLAST SEARCH 635
Figure 26.13: BLAST databases are made up of several files. The exact number varies. Large
databases will be split into the number of volumes and there will be several files per volume.
After selecting the sequences or sequence lists to include in your database and clicking on Next,
you provide information about the BLAST database being made figure 26.15:
• Name. The name of the BLAST database. This name will be used when running BLAST
searches and also as the base file name for the BLAST database files.
• Description. A short description. This is displayed along with the database name in the list
of available databases when launching a local BLAST search. If no description is entered,
the creation date is used as the description.
• Location. The BLAST database location to save the BLAST database files to.
Click Finish to create the BLAST database. Once the process is complete, the new database
CHAPTER 26. BLAST SEARCH 636
Figure 26.15: Provide information about the BLAST database being created and specify where the
files should be saved to.
will be available in the Manage BLAST Databases dialog, see section 26.4, and when launching
lBLAST jobs (see section 26.1.2).
Create BLAST Database creates BLAST+ version 4 (dbV4) databases.
Figure 26.16: BLAST databases are listed and can be managed using Manage BLAST Databases.
At the top of the dialog, there is a list of the BLAST database locations. These locations are
folders where the Workbench will look for valid BLAST databases. These can either be created
CHAPTER 26. BLAST SEARCH 637
from within the Workbench using the Create BLAST Database tool, see section 26.3.3, or they
can be pre-formatted BLAST databases.
The list of locations can be modified using the Add Location and Remove Location buttons.
Once the Workbench has scanned the locations, it will keep a cache of the databases (in order
to improve performance). If you have added new databases that are not listed, you can press
Refresh Locations to clear the cache and search the database locations again.
By default a BLAST database location will be added under your home area in a folder called
CLCdatabases. This folder is scanned recursively, through all subfolders, to look for valid
databases. All other folder locations are scanned only at the top level.
Below the list of locations, all the BLAST databases are listed with the following information:
• Total size (1000 residues). The number of residues in the database, either bases or amino
acid.
Below the list of BLAST databases, there is a button to Remove Database. This option will delete
the database files belonging to the database selected.
Searching for homology Most research projects involving sequencing of either DNA or protein
have a requirement for obtaining biological information of the newly sequenced and maybe
unknown sequence. If the researchers have no prior information of the sequence and biological
content, valuable information can often be obtained using BLAST. The BLAST algorithm will search
for homologous sequences in predefined and annotated databases of the users choice.
CHAPTER 26. BLAST SEARCH 638
In an easy and fast way the researcher can gain knowledge of gene or protein function and find
evolutionary relations between the newly sequenced DNA and well established data.
A BLAST search generates a report specifying the potentially homologous sequences found and
their local alignments with the query sequence.
Seeding When finding a match between a query sequence and a hit sequence, the starting
point is the words that the two sequences have in common. A word is simply defined as a number
of letters. For blastp the default word size is 3 W=3. If a query sequence has a QWRTG, the
searched words are QWR, WRT, RTG. See figure 26.17 for an illustration of words in a protein
sequence.
Figure 26.17: Generation of exact BLAST words with a word size of W=3.
During the initial BLAST seeding, the algorithm finds all common words between the query
sequence and the hit sequence(s). Only regions with a word hit will be used to build on an
alignment.
BLAST will start out by making words for the entire query sequence (see figure 26.17). For each
word in the query sequence, a compilation of neighborhood words, which exceed the threshold
of T, is also generated.
A neighborhood word is a word obtaining a score of at least T when comparing, using a selected
scoring matrix (see figure 26.18). The default scoring matrix for blastp is BLOSUM62. The
compilation of exact words and neighborhood words is then used to match against the database
sequences.
After the initial finding of words (seeding), the BLAST algorithm will extend the (only 3 residues
long) alignment in both directions (see figure 26.19). Each time the alignment is extended, an
alignment score is increases/decreased. When the alignment score drops below a predefined
threshold, the extension of the alignment stops. This ensures that the alignment is not extended
to regions where only very poor alignment between the query and hit sequence is possible. If
the obtained alignment receives a score above a certain threshold, it will be included in the final
BLAST result.
CHAPTER 26. BLAST SEARCH 639
Figure 26.18: Neighborhood BLAST words based on the BLOSUM62 matrix. Only words where the
threshold T exceeds 13 are included in the initial seeding.
Figure 26.19: Blast aligning in both directions. The initial word match is marked green.
By tweaking the word size W and the neighborhood word threshold T, it is possible to limit the
search space. E.g. by increasing T, the number of neighboring words will drop and thus limit the
search space as shown in figure 26.20.
Figure 26.20: Each dot represents a word match. Increasing the threshold of T limits the search
space significantly.
This will increase the speed of BLAST significantly but may result in loss of sensitivity. Increasing
CHAPTER 26. BLAST SEARCH 640
the word size W will also increase the speed but again with a loss of sensitivity.
The E-value The expect value (E-value) describes the number of hits one can expect to see
matching the query by chance when searching against a database of a given size. An E-value of
1 can be interpreted as meaning that in a search like the one just run, you could expect to see 1
match of the same score by chance once. That is, a match that is not homologous to the query
sequence. When looking for very similar sequences in a database, it is often beneficial to use
very low E-values.
E-values depend on the query sequence length and the database size. Short identical sequence
may have a high E-value and may be regarded as "false positive" hits. This is often seen if one
searches for short primer regions, small domain regions etc. Below are some comments on what
one could infer from results with E-values in particular ranges.
• E-value < 10e-100 Identical sequences. You will get long alignments across the entire
query and hit sequence.
• 10e-100 < E-value < 10e-50 Almost identical sequences. A long stretch of the query
matches the hit sequence.
CHAPTER 26. BLAST SEARCH 641
• 10e-50 < E-value < 10e-10 Closely related sequences, could be a domain match or similar.
• 10e-10 < E-value < 1 Could be a true homolog, but it is a gray area.
• E-value > 10 Hits are most likely not related unless the query sequence is very short.
Gap costs For blastp it is possible to specify gap cost for the chosen substitution matrix. There
is only a limited number of options for these parameters. The open gap cost is the price of
introducing gaps in the alignment, and extension gap cost is the price of every extension past the
initial opening gap. Increasing the gap costs will result in alignments with fewer gaps.
Filters It is possible to set different filter options before running a BLAST search. Low-complexity
regions have a very simple composition compared to the rest of the sequence and may result in
problems during the BLAST search [Wootton and Federhen, 1993]. A low complexity region of a
protein can for example look like this 'fftfflllsss', which in this case is a region as part of a signal
peptide. In the output of the BLAST search, low-complexity regions will be marked in lowercase
gray characters (default setting). The low complexity region cannot be thought of as a significant
match; thus, disabling the low complexity filter is likely to generate more hits to sequences which
are not truly related.
Word size Changing the word size has a great impact on the seeded sequence space as
described above. But one can change the word size to find sequence matches which would
otherwise not be found using the default parameters. For instance the word size can be
decreased when searching for primers or short nucleotides. For blastn a suitable setting would
be to decrease the default word size of 11 to 7, increase the E-value significantly (1000) and
turn off the complexity filtering.
For blastp a similar approach can be used. Decrease the word size to 2, increase the E-value
and use a more stringent substitution matrix, e.g. a PAM30 matrix.
The BLAST search programs at the NCBI adjust settings automatically when short sequences are
being used for searches, and there is a dedicated page, Primer-BLAST, for searching for primer
sequences. https://blast.ncbi.nlm.nih.gov/Blast.cgi.
Substitution matrix For protein BLAST searches, a default substitution matrix is provided. If
you are looking at distantly related proteins, you should either choose a high-numbered PAM
matrix or a low-numbered BLOSUM matrix. The default scoring matrix for blastp is BLOSUM62.
A few commercial software packages are available for searching your own data. The advantage
of using a commercial program is obvious when BLAST is integrated with the existing tools of
these programs. Furthermore, they let you perform BLAST searches and retain annotations on
the query sequence (see figure 26.21). It is also much easier to batch download a selection of
hit sequences for further inspection.
Figure 26.21: Snippet of alignment view of BLAST results. Individual alignments are represented
directly in a graphical view. The top sequence is the query sequence and is shown with a selection
of annotations.
Utility tools
Contents
27.1 Extract Annotated Regions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 644
27.2 Combine Reports . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 646
27.2.1 Combine Reports output . . . . . . . . . . . . . . . . . . . . . . . . . . 649
27.3 Create Report from Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 651
27.4 Modify Report Type . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 652
27.4.1 Modifying report types in workflows . . . . . . . . . . . . . . . . . . . . . 653
27.5 Create Sequence List . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 656
27.6 Update Sequence Attributes in Lists . . . . . . . . . . . . . . . . . . . . . . . 656
27.7 Split Sequence List . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 661
27.8 Rename Elements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 662
27.9 Rename Sequences in Lists . . . . . . . . . . . . . . . . . . . . . . . . . . . 667
644
CHAPTER 27. UTILITY TOOLS 645
Note: With the CLC Main Workbench, annotated sequences are expected as input. Options
relating to track-based input are intended for the CLC Genomics Workbench, where this tool is
also present.
Figure 27.1: Select one or more sequences to extract annotated regions from.
At the top of the next dialog step (figure 27.2) you can specify which annotations to use.
• Search terms All annotations and attached information for each annotation will be searched
for the entered term. It can be used to make general searches for search terms such as
"Gene" or "Exon", or it can be used to make more specific searches. For example, if you
have a gene annotation called "MLH1" and another called "MLH3", you can extract both
annotations by entering "MLH" in the search term field. If you wish to enter more specific
search terms, separate them with commas: "MLH1, Human" will find annotations where
both "MLH1" and "Human" are included.
• Annotation types If only certain types of annotations should be extracted, this can be
specified here.
• Flanking upstream residues The output will include this number of extra residues at the 5'
end of the annotation.
• Flanking downstream residues The output will include this number of extra residues at the
3' end of the annotation.
CHAPTER 27. UTILITY TOOLS 646
The sequences that are created can be named after the annotation name, type, etc:
• Include annotation name This will use the name of the annotation in the name of the
extracted sequence.
• Include annotation type This corresponds to the type chosen above and will put this
information in the name of the resulting sequences. This is useful information if you have
chosen to extract "All" types of annotations.
• Include annotation region The region covered by the annotation on the original sequence
(i.e. not including flanking regions) will be included in the name.
• Include sequence/track name If you have selected more than one sequence as input, this
option enables you to discern the origin of the resulting sequences in the list by putting the
name of the original sequence into the name of the resulting sequences.
• Order of inputs Use the order that the input reports were specified in.
• Define order Explicitly define the section order by moving items up and down in the listing.
Defining the order is recommended when the tool is being launched in batch mode with
folders of reports provided as the batch units. Doing this avoids reliance on the order of
the elements within the folders being the same.
When Combine Reports is included in a workflow, sections are ordered according to the order of
the inputs. See section 13.1.3 for information about ordering inputs in workflows.
CHAPTER 27. UTILITY TOOLS 647
Figure 27.3: Clicking on the info icon at the top right corner of the input selection wizard opens a
window showing a list of tools that produce supported reports. Text entered in the field at the top
limits the list to just tools with names containing the search term.
Figure 27.4: When more than one report type is provided as input, the order of the sections can be
configured in the "Set order" wizard step.
Where available, individual subsections and summary items can also be specified. Where
only some subsections or summary items are excluded, the checkbox for the parent
section(s) are highlighted for visibility.
Figure 27.5: The content of the combined report is configured in the "Set contents" wizard step.
Sections with a check in the box are included, while those without a check are excluded from the
combined report. For visibility, sections where some contents have been excluded have checkboxes
highlighted.
Reusing configurations
Configurations defined previously can be used in subsequent runs.
Configurations can be copied in two ways:
• Copy the configuration defined in the relevant wizard step using the Copy all button.
• Copy the configuration used in previous runs of the tool from the History ( ) view of the
output, described further below.
Copied configurations can be pasted into a text file for later use.
A copied configuration can be pasted into the wizard step using the Paste button.
Any existing settings in that wizard step will be overwritten.
The history of a report output by Combine Reports contains both the order of the sections (Order
reports) and the excluded sections/subsections/summary items (Exclude) (figure 27.6). These
CHAPTER 27. UTILITY TOOLS 649
can be selected, copied, and then pasted into the "Set order"/"Set contents" wizard steps,
respectively, in a subsequent run. Alternatively, the entire history can be selected, copied, and
then pasted in each wizard step. Only the relevant configuration is pasted into each step.
Figure 27.6: The history of a report output by Combine Reports with the parameters selected, ready
to be copied.
• Reports with the same type that are generated by the same tool are summarized into a
single section, named according to the type.
This is useful when the aim is to compare the values from those reports, for example
results from different samples or different analysis runs. However, if a particular tool has
been used more than once in an analysis, for different purposes, then placing the summary
of these results in different sections may be desirable. This can be done by editing the
report type in some of the reports (see section 27.4.).
• Reports with different types are summarized in separate sections, named according to the
types.
The report type assigned by a particular tool is unique, so reports generated by different
tools have different types.
If reports generated by different tools are later modified so their report types are the same,
those reports will still be summarized in different sections, although each of these sections
will have the same name.
The type of a report can be seen in the Element Info ( ) view for that report.
Figure 27.7: The type of a report can be found in the Element Info view of reports that are
supported as input for tools that summarize reports.
The report contains one section per input report type, as described in section 27.2. Summary
items are displayed in table format.
Note: The summaries for reports produced by Trim Sequences do not follow the format described
below.
The tables contain one row per input report and one column per summary item. The last rows,
shaded in pale gray, report the minimum, median, maximum, mean and standard deviation for
all numeric summary items (figure 27.8).
The first column indicates the sample name, i.e. the name of the input report. The combined report
contains links to the input reports and clicking on the sample name selects the corresponding
report in the Navigation Area.
Highlighted cells
Table cells are highlighted in yellow if they are detected as outliers (figure 27.8). For each numeric
summary item, the lower quartile - 1.5 IQR (interquartile range) to upper quartile + 1.5 IQR range
is calculated using all the values for the summary item. Samples with values outside this range
are considered outliers.
Summary section
By default, combined reports contain a summary section, offering a quick overview of samples
that have been identified as outliers and/or problematic. The summary section is only present
if it was included when configuring the report content (see section 27.2) and it only contains
summaries those sections/subsections/summary items that are also included in the combined
report.
CHAPTER 27. UTILITY TOOLS 651
Figure 27.8: Summary items are reported in tables. Cells are highlighted in yellow when identified
as outliers.
• variant track ( )
• annotation track ( )
• expression track ( )
Note that tables with many rows will create very long reports. Consider filtering the table first.
Filtering can be done manually when in table view, see section 9.2.
To run Create Report from Table, go to:
Tools | Utility Tools ( ) | Reports ( ) | Create Report from Table ( )
After selecting the input element, the columns to include in the report must be defined. Column
definitions consist of four parts:
• New name. The name this column should have in the report. Shorter names are often
preferred in reports, because PDF exports have limited width. If the name is left blank, the
column will not be renamed.
• Sort. Whether the column in the table should be sorted in ascending or descending order.
When left blank, the sorting will match that of the input table.
CHAPTER 27. UTILITY TOOLS 652
• Sort order. The order in which sorting should be applied (only relevant when sorting on
multiple columns). Sorting is applied to columns in order from smallest to largest sort order
i.e., column with sort order 1 is sorted on before column with sort order 2, and so on. It is
not possible to manually enter a sort order. Instead the order is populated automatically
according to the order in which columns are chosen for sorting. To change an existing sort
order, toggle sorting of the affected columns off and on, such that they receive new sort
orders.
To make defining columns easier, the Load Attributes button can be used to populate a dropdown
list of the columns found in the element selected in the Template element field (figure 27.9). By
default, the tool's input is preselected. Use the browse ( ) button to select a different element.
If a template element is not used for populating a dropdown list, columns can be entered by
typing directly in the Column field.
The Add button adds additional columns while pressing the X ( ) button to the right of a column
removes it. It is possible to reorder columns using the Up and Down buttons. The Clear button
removes all defined columns.
If a column is defined that is not present in the input element, then an empty column with that
name will be placed in the report.
Combined reports
Report sections from this tool cannot be used in Combine Reports, because the contents of the
tables may be specific to each sample.
tool can be used. Both options are described in this section. Note that report types are case
sensitive. E.g. 'Trim by Quality' and 'Trim by quality' are interpreted as different types.
The report type assigned by a particular tool is unique, so reports generated by different tools
have different types. The term "(default)" at the end of a report type suggests the type has not
been modified since the report was created.
Figure 27.10: Reports types can be seen in the Element Info view of a report.
• Two Trim Sequences workflow elements, named "Trim by Quality" and "Trim by Ambiguous",
to reflect the type of trimming performed.
CHAPTER 27. UTILITY TOOLS 654
Figure 27.11: Clicking on the info icon at the top right corner of the input selection wizard opens a
window showing a list of tools that produce supported report types. Text entered in the field at the
top limits the list to just tools with names containing the search term.
Figure 27.12: Enter the report type to assign in the "Report type" field.
• Two Modify Report Type workflow elements, named "Modify Report Type to Trim by Quality"
and "Modify Report Type to Trim by Ambiguous", to reflect which reports it modifies and
the type it sets.
• One Combine Reports workflow element, which uses the two trim reports with modified
types.
Figure 27.13: An example workflow running two trimming jobs. The name of each trim element is
different but the underlying tool is the same, so the reports generated have the same type. The
report types are then modified, and reports with the modified type are used as input to the next
step.
The content of the summary reports can only be defined for the default report types and
applies too all reports, even those with a modified type, that originally had that report type
(figure 27.14).
CHAPTER 27. UTILITY TOOLS 656
Figure 27.14: Defining the contents for trimming applies to all reports produced by the trimming
tool, regardless of their report type.
• This tool is recommended when updating information for many sequences. However,
attributes can also be updated individually, either directly in the Table view of the Sequence
List (see section 14.1.3), or, by opening the sequence from the Sequence List and editing
attributes in its Element info view. Opening a sequence can be done from the Sequence List
view (right-click on the sequence name and choose "Open Sequence") or from the Table
view (right-click in the row for that sequence and choose the option "Open This Sequence").
• Attributes relating to characteristics of the sequence itself, such as its length or the start
CHAPTER 27. UTILITY TOOLS 657
of the sequence, cannot be updated using this tool, nor by directly editing the Sequence
List.
In the Settings wizard step, the file containing attribute information is specified, along with details
about how to handle that information (figure 27.16).
Figure 27.16: Information in the attribute file will be matched with the relevant sequence based
on contents of the Name column in the file and in the Sequence List. Five columns containing
relevant attribute information have been selected. The option to overwrite existing information has
been left unchecked.
• Attribute file Select an Excel file (.xls/xlsx), a comma separated text file (.csv), or a
tab separated text file (.tsv) containing attribute information. Column names are used
as attribute names, so a header row is required. One column in the file must contain
information that can be matched with information already present in the Sequence List (see
"Column to match on", below).
CHAPTER 27. UTILITY TOOLS 658
• Column to match on Specify the column in the attribute file to use to match each row
with the relevant sequence(s) in the Sequence List. When a value in this column matches
a value in the column of the same name in the Sequence List, information from that row
in the file is added to the attribute information for that sequence. Only information from
specified columns will be added (see "Include columns", below.)
When matching based on sequence names, the column in the file containing the names
must be called Name.
• Include columns Select the columns in the file containing the information to be updated or
added to the Sequence List as well as the column specified in the "Column to match on"
field.
When the name of a column does not match existing attribute name in the Sequence List,
a new attribute will be added.
• Overwrite existing information When this option is checked, existing sequence attribute
values will be overwritten by values for the corresponding attributes in the attribute file.
When no corresponding value is present in the attributes file, no change is made to the
value in the Sequence List.
When left unchecked, existing attribute values in the Sequence List are not overwritten with
new information from the file.
• Download taxonomy Check this box to download a 7-step taxonomy from the NCBI into an
attribute called "Taxonomy". To use this option, there must be a column in the attributes
file called TaxID containing valid taxonomic identifiers. See the "Column headings and
value validation" section below for further details.
The "Taxonomy" attribute will be listed in the Preview wizard step, alongside the columns
selected for inclusion.
The result of the choices made in the Settings step are reflected in the Preview wizard step
(figure 27.17). In the upper pane is a list of the attribute types to be updated or added,
as well as the attribute to be used to match sequences with the relevant information. How
particular columns will be handled is indicated in the "Content handling" column, including
whether validation will be applied. The columns subject to validation checks are described later
in this section.
Shown in the lower pane is a small subset of the incoming information from the attribute file,
based on the choices made in the Settings wizard step. Click on the "Previous" button to go
back to that step if anything needs to be adjusted.
Figure 27.17: The Preview wizard steps shows information about how columns from the attribute
file will be handled, and whether any problems were detected. Where validation checks are carried
out, if any had failed, a yellow exclamation mark in the bottom pane would be shown for that
column. Here, all entries pass. The "Other" column is not subject to validation checks. Only one
sequence in the list is being updated in this example.
• TaxID When valid taxonomic identifiers are found in a column called TaxID, and the
Download taxonomy checkbox was checked in the Settings wizard step, then a 7-step
taxonomy is downloaded from the NCBI.
Examples of valid identifiers for TaxID attribute are those found in /db_xref="taxon
fields in Genbank entries. For example, for /db_xref="taxon:5833, the expected value
in the TaxID column would be 5833.
If a given sequence has an value already set for the Taxonomy attribute, then that existing
value remains in place unless the "Overwrite existing information" box was checked in the
Settings wizard step.
• Gene ID The following identifiers in a Gene ID column are added as attribute values and
hyperlinked to the relevant online database:
Any other values in a Gene ID column are added as attributes to the relevant sequences,
but are not hyperlinked to an online data resource. Note that this is different to how other
non-validated attribute values are handled, as described below.
Multiple identifiers in a given cell, separated by commas, will be added as multiple Gene
ID attributes for the relevant sequence. If any one of those identifiers is not recognized as
one of the above types, then none will be hyperlinked.
Other columns where contents are validated are those with the headings listed below. If a value
in such a column cannot be validated, it is not added nor used to update attributes.
If you wish to add information of this type but do not want this level of validation applied, use a
heading other than the ones listed below.
• EC numbers EC identifiers
• Because these attributes are tied to the Location, they will not appear until the updated
Sequence List has been saved.
• The updated Sequence List must be saved to the same File Location as the input for these
attributes and their values to appear.
• If this tool is run on an unsaved Sequence List , or using inputs from more than one File
Location at the same time, Location-specific attributes will not be updated. Information in
the preview pane reflects this.
Figure 27.18: Sequence lists can be split into a set number of groups, or into lists containing
particular numbers of sequences, or split based on attribute values.
• Split into N lists In the "Number of lists to create" box, enter the number of lists to split
the input into.
• Create lists with N sequences each In the "Number of sequences per list" box, enter the
relevant number. The final sequence list in the set created may contain fewer than this
number.
• Split based on attribute values Specify the attribute to split upon from the drop-down list.
Columns in the table view of a sequence list equate to the attributes that the list can be
split upon.
If no information is entered into the "Attribute values" field, a sequence list is created
for each unique value of the specified attribute. If values are provided, a sequence list
is created for each of these where at least one sequence has that attribute value. For
example, if 3 values are specified, and sequences were found with attributes matching each
of these values, 3 sequence lists would be created. If no sequences were found containing
1 of those attribute values, then only 2 sequence lists would be created. Check the "Collect
sequences without matches" box to additionally produce a sequence list containing the
sequences where no match to a specified value was identified.
CHAPTER 27. UTILITY TOOLS 662
Figure 27.19: With the settings shown here, 3 sequence lists were created. These lists are open
in the background tabs shown. One contains sequences with descriptions that include the term
"Putative", one contains sequences with descriptions that include the term "Uncharacterized", and
one contains sequences containing neither term in the desccription.
Figure 27.21: Right-click for options to add the contents of folders as inputs. Here, the "Add folder
contents (recursively)" option was selected. If "Add folder contents" had been selected, only the
elements seqlist1 and seqlist2 would have been added to the Selected elements list on the right.
selection, then checking the Batch box provides the opportunity to do that. When checked, the
next wizard step shows the batch overview, where elements can be explicitly included or excluded
from those to be renamed, based on text patterns. This step is described in more detail at
section 11.3).
Checking the Batch checkbox for this tool also has the following effect when a folder is selected
as input:
• With the Batch option checked, the top level contents of that folder will be renamed.
• With the Batch option unchecked, the folder itself will be renamed.
• Renaming elements cannot be undone. To alter the names further, the elements must be
renamed again.
• The renaming action is recorded in the History ( ) for the element, but the "Originates
from" entries lists the changed element name, rather than the original element name.
Renaming options
This wizard step presents various options for the renaming action (figure 27.22). The Rename
Elements is used for illustration in this section, but the options are the same for the Rename
Sequences in Lists tool.
Figure 27.22: Text can be added, removed or replaced in the existing names.
• Add text to name Select this option to add text at the beginning or the end of the existing
name.
You can add text directly to these fields, and you can also include placeholders to indicate
certain types of information should be added. Multiple placeholders can be used, in
combination with other text if desired (figure 27.23). The available placeholders are:
{name} The current name of the element. Usually used when defining a new naming
pattern for replacing the full name of elements.
{shortname} Truncates the original name to 10 characters. Usually used when
replacing the full names of elements.
{Parent folder} The name of the folder containing the element.
{today} Today's date in the form YYYY-MM-DD
{enumeration} Adds a number to the name. This is intended for use when multiple
elements are selected as input. Each is assigned a number, starting with the number
1, (added as 0000001) for the first element that was selected, 2 (added as 0000002)
for the second element selected, and so on.
CHAPTER 27. UTILITY TOOLS 665
Click in a field and use Shift + F1 (Shift + Fn + F1 on Mac) to show the list of available
placeholders, as shown in figure 27.23. Click on a placeholder in that list to have it entered
into the field.
Figure 27.23: Click on Shift+F1 (Shift + Fn + F1 on Mac) to reveal a drop-down list of placeholders
that can be used. Here, today's date and a hypen would be prepended, and a hyphen and
ascending numeric value appended, to the existing names.
• Shorten name Select this option to shorten a name by removing a specified number of
characters from the start and/or end of the name.
• Replace part of name Select this option to specify text or regular expressions to define
parts of the element names to be replaced. By default, the text entered in the fields
is interpreted literally. Check the "Interpret 'Replace' as regular expression" option to
indicate that the terms provided in the "Replace" field should be treated as regular
expressions. Information on regular expressions can be found at https://docs.
oracle.com/javase/tutorial/essential/regex/.
By clicking in either the "Replace" or "with" field and pressing Shift + F1 (Shift + Fn + F1
on Mac), a drop down list of renaming possibilities is presented. The options listed for
the Replace field are some commonly used regular expressions. Other standard regular
expressions are also admissible in this field. The placeholders described above for adding
text to names are available for use in the "with" field. Note: We recommend caution when
using these placeholders in combination with regular expressions in the Replace field.
Please run a small test to ensure it works as you intend.
• Replace full name Select this option to replace the full element name. Text and placeholders
can be used in this field. The placeholders described above for adding text to names are
available for use. Use Shift + F1 (Shift + Fn + F1 on Mac) to see a list.
• Replacing part of an element's name with today's date and an underscore. Details are
shown in figure 27.24.
Figure 27.24: Elements with names Seqlist1 and Seqlist2 each start with a capital letter, followed
by 6 small letters. Using the settings shown, their names are updated to be the date the renaming
was done, followed by a hypen, and the remaining parts of the original name, here, the integer at
the end of each name.
• Rename using the first 4 non-whitespace characters from names that start with 2 characters,
then have a space, then have multiple characters following, such as 1N R1\_0001.
Check the box beside "Interpret 'Replace' and 'with' as Java regular expressions".
Enter ([\w]{2})\s([\w]{2}).* into the "Replaces" field.
Enter $1$2 into the "with" field.
Check the box beside "Interpret 'Replace' and 'with' as Java regular expressions".
Enter (.*)(.{4}$) into the "Replaces" field.
Enter $2 into the "with" field.
• Replace a set pattern of text with the name of the parent folder. Here, we start with the
name p140101034_1R_AMR and replace the first letter and 9 numbers with the parent
folder name.
Check the box beside "Interpret 'Replace' and 'with' as Java regular expressions".
Enter ([a-z]\d{9})(.*) into the "Replaces" field.
Enter {parentfolder}$2 into the "with" field.
CHAPTER 27. UTILITY TOOLS 667
• Rename using just the text between the first and second underscores in 1234_sample-code_5678.
Check the box beside "Interpret 'Replace' and 'with' as Java regular expressions".
Enter (^[^_]+)_([^_]+)_(.*) into the "Replaces" field.
Enter $2 into the "with" field.
Figure 27.25: When the sequences in more than one list should be renamed, check the Batch
checkbox.
The text "Renamed" is added within parentheses to the name of sequence lists output by this
tool. E.g. with an input called "seqlist2", the sequence list containing the renamed sequences
will be called "seqlist2 (Renamed)".
Renaming options
This wizard step presents various options for the renaming action (figure 27.26). The Rename
Elements is used for illustration in this section, but the options are the same for the Rename
Sequences in Lists tool.
• Add text to name Select this option to add text at the beginning or the end of the existing
name.
CHAPTER 27. UTILITY TOOLS 668
Figure 27.26: Text can be added, removed or replaced in the existing names.
You can add text directly to these fields, and you can also include placeholders to indicate
certain types of information should be added. Multiple placeholders can be used, in
combination with other text if desired (figure 27.27). The available placeholders are:
{name} The current name of the element. Usually used when defining a new naming
pattern for replacing the full name of elements.
{shortname} Truncates the original name to 10 characters. Usually used when
replacing the full names of elements.
{Parent folder} The name of the folder containing the element.
{today} Today's date in the form YYYY-MM-DD
{enumeration} Adds a number to the name. This is intended for use when multiple
elements are selected as input. Each is assigned a number, starting with the number
1, (added as 0000001) for the first element that was selected, 2 (added as 0000002)
for the second element selected, and so on.
Click in a field and use Shift + F1 (Shift + Fn + F1 on Mac) to show the list of available
placeholders, as shown in figure 27.27. Click on a placeholder in that list to have it entered
into the field.
• Shorten name Select this option to shorten a name by removing a specified number of
characters from the start and/or end of the name.
• Replace part of name Select this option to specify text or regular expressions to define
parts of the element names to be replaced. By default, the text entered in the fields
is interpreted literally. Check the "Interpret 'Replace' as regular expression" option to
indicate that the terms provided in the "Replace" field should be treated as regular
expressions. Information on regular expressions can be found at https://docs.
oracle.com/javase/tutorial/essential/regex/.
By clicking in either the "Replace" or "with" field and pressing Shift + F1 (Shift + Fn + F1
on Mac), a drop down list of renaming possibilities is presented. The options listed for
CHAPTER 27. UTILITY TOOLS 669
Figure 27.27: Click on Shift+F1 (Shift + Fn + F1 on Mac) to reveal a drop-down list of placeholders
that can be used. Here, today's date and a hypen would be prepended, and a hyphen and
ascending numeric value appended, to the existing names.
the Replace field are some commonly used regular expressions. Other standard regular
expressions are also admissible in this field. The placeholders described above for adding
text to names are available for use in the "with" field. Note: We recommend caution when
using these placeholders in combination with regular expressions in the Replace field.
Please run a small test to ensure it works as you intend.
• Replace full name Select this option to replace the full element name. Text and placeholders
can be used in this field. The placeholders described above for adding text to names are
available for use. Use Shift + F1 (Shift + Fn + F1 on Mac) to see a list.
• Replacing part of an element's name with today's date and an underscore. Details are
shown in figure 27.28.
• Rename using the first 4 non-whitespace characters from names that start with 2 characters,
then have a space, then have multiple characters following, such as 1N R1\_0001.
Check the box beside "Interpret 'Replace' and 'with' as Java regular expressions".
Enter ([\w]{2})\s([\w]{2}).* into the "Replaces" field.
Enter $1$2 into the "with" field.
Check the box beside "Interpret 'Replace' and 'with' as Java regular expressions".
CHAPTER 27. UTILITY TOOLS 670
Figure 27.28: Elements with names Seqlist1 and Seqlist2 each start with a capital letter, followed
by 6 small letters. Using the settings shown, their names are updated to be the date the renaming
was done, followed by a hypen, and the remaining parts of the original name, here, the integer at
the end of each name.
• Replace a set pattern of text with the name of the parent folder. Here, we start with the
name p140101034_1R_AMR and replace the first letter and 9 numbers with the parent
folder name.
Check the box beside "Interpret 'Replace' and 'with' as Java regular expressions".
Enter ([a-z]\d{9})(.*) into the "Replaces" field.
Enter {parentfolder}$2 into the "with" field.
• Rename using just the text between the first and second underscores in 1234_sample-code_5678.
Check the box beside "Interpret 'Replace' and 'with' as Java regular expressions".
Enter (^[^_]+)_([^_]+)_(.*) into the "Replaces" field.
Enter $2 into the "with" field.
Part IV
Appendix
671
Appendix A
Graph preferences
This section explains the view settings of graphs. The Graph preferences at the top of the Side
Panel includes the following settings:
• Lock axes This will always show the axes even though the plot is zoomed to a detailed
level.
• Tick type Determine whether tick lines should be shown outside or inside the frame.
• Tick lines at Choosing Major ticks will show a grid behind the graph.
• Horizontal axis range Sets the range of the horizontal axis (x axis). Enter a value in Min
and Max, and press Enter. This will update the view. If you wait a few seconds without
pressing Enter, the view will also be updated.
• Vertical axis range Sets the range of the vertical axis (y axis). Enter a value in Min and
Max, and press Enter. This will update the view. If you wait a few seconds without pressing
Enter, the view will also be updated.
• X-axis at zero. This will draw the x axis at y = 0. Note that the axis range will not be
changed.
• Y-axis at zero. This will draw the y axis at x = 0. Note that the axis range will not be
changed.
• Show as histogram. For some data-series it is possible to see the graph as a histogram
rather than a line plot.
The representation of the data is configured in the bottom area, e.g. line widths, dot types,
colors, etc. For graphs of multiple data series, the series to apply the settings to can be selected
from a drop down list.
672
APPENDIX A. GRAPH PREFERENCES 673
• Dot type Can be None, Cross, Plus, Square, Diamond, Circle, Triangle, Reverse triangle, or
Dot.
The graph and axes titles can be edited simply by clicking with the mouse. These changes will be
saved when you Save ( ) the graph - whereas the changes in the Side Panel need to be saved
explicitly (see section 4.6).
Appendix B
Most proteolytic enzymes cleave at distinct patterns. Below is a compiled list of proteolytic
enzymes used in CLC Main Workbench.
674
APPENDIX B. PROTEOLYTIC CLEAVAGE ENZYMES 675
CLC Main Workbench uses enzymes from the REBASE restriction enzyme database at http:
//rebase.neb.com. If you wish to add enzymes to this list, you can do this by manually using
the procedure described here.
Note! Please be aware that this process needs to be handled carefully, otherwise you may
have to re-install the Workbench to get it to work.
First, download the following file: https://resources.qiagenbioinformatics.com/
wbsettings/link_emboss_e_custom. In the Workbench installation folder under settings,
create a folder named rebase and place the extracted link_emboss_e_custom file here.
Note that in macOS, the extension file "link_emboss_e_custom" will have a ".txt" extension in
its filename and metadata that needs to be removed. Right click the file name, choose "Get
info" and remove ".txt" from the "Name & extension" field.
Open the file in a text editor. The top of the file contains information about the format, and at the
bottom there are two example enzymes that you should replace with your own.
Please note that the CLC Workbenches only support the addition of 2-cutter enzymes. Further
details about how to format your entries accordingly are given within the file mentioned above.
After adding the above file, or making changes to it, you must restart the Workbench for changes
take effect.
676
Appendix D
The CLC Main Workbench comes with a pre-defined list of Gateway recombination sites. These
sites and the recombination logics can be modified by downloading and editing a properties file.
Note that this is a technical procedure only needed if the built-in functionality is not sufficient for
your needs.
The properties file can be downloaded from https://resources.qiagenbioinformatics.
com/wbsettings/gatewaycloning.zip. Extract the file included in the zip archive and save
it in the settings folder of the Workbench installation folder. The file you download contains
the standard configuration. You should thus update the file to match your specific needs. See
the comments in the file for more information.
The name of the properties file you download is gatewaycloning.1.properties. You
can add several files with different configurations by giving them a different number, e.g.
gatewaycloning.2.properties and so forth. When using the Gateway tools in the Work-
bench, you will be asked which configuration you want to use (see figure D.1).
677
Appendix E
678
Appendix F
679
Appendix G
680
APPENDIX G. FORMATS FOR IMPORT AND EXPORT 681
• When importing trace data, the called bases in the file are imported and the chromatogram
information associated with the called bases is imported. If the base calls within the file
have already been trimmed, the part of the chromatogram not associated with base calls
will not be imported.
• The Trim Sequences tool, described in section 21.2, adds annotations to trimmed regions.
When exporting to fasta format, there is an option to remove sequence ends covered by
Trim annotations.
APPENDIX G. FORMATS FOR IMPORT AND EXPORT 682
The CLC Main Workbench supports analysis of one-color expression arrays. These may be
imported from GEO soft sample- or series- file formats, or for Affymetrix arrays, tab-delimited pivot
or metrics files, or from Illumina expression files. Expression array data from other platforms may
be imported from tab, semi-colon or comma separated files containing the expression feature IDs
and levels in a tabular format (see internalrefsec:customexpressiondataformatssectionGeneric
expression and annotation data file formats).
The CLC Main Workbench assumes that expression values are given at the gene level, thus probe-
level analysis of Affymetrix GeneChips and import of Affymetrix CEL and CDF files is currently not
supported. However, the CLC Main Workbench allows import of txt files exported from R containing
processed Affymetrix CEL-file data (see internalrefsec:AffymetrixGeneChipFormatssectionAffymetrix
GeneChip).
Affymetrix NetAffx annotation files for expression GeneChips in csv format and Illumina annotation
files can also be imported.
Also, you may import your own annotation data in tabular format (see internalrefsec:customexpressiondataforma
expression and annotation data file formats).
Below you find descriptions of the microarray data formats that are supported by CLC Main
Workbench. Note that we for some platforms support both expression data and annotation data.
^SAMPLE = GSM21610
!sample_table_begin
...
!sample_table_end
684
APPENDIX H. GENE EXPRESSION ANNOTATION FILES AND MICROARRAY DATA FORMATS 685
Figure H.1: Selecting Samples, SOFT and Data before clicking go will give you the format supported
by the CLC Main Workbench.
The first line should start with ^SAMPLE = followed by the sample name, the line !sample_table_begin
and the line !sample_table_end. Between the !sample_table_begin and !sample_table_end,
lines are the column contents of the sample.
Note that GEO sample importer will also work for concatenated GEO sample files --- allowing
multiple samples to be imported in one go. Download a sample file containing concatenated
sample files here:
https://resources.qiagenbioinformatics.com/madata/GEOSampleFilesConcatenated.
txt
Below you can find examples of the formatting of the GEO formats.
^SAMPLE = GSM21610
!sample_table_begin
ID_REF VALUE
id1 105.8
id2 32
id3 50.4
id4 57.8
id5 2914.1
!sample_table_end
^SAMPLE = GSM21610
APPENDIX H. GENE EXPRESSION ANNOTATION FILES AND MICROARRAY DATA FORMATS 686
!sample_table_begin
ID_REF VALUE ABS_CALL
id1 105.8 M
id2 32 A
id3 50.4 A
id4 57.8 A
id5 2914.1 P
!sample_table_end
^SAMPLE = GSM21610
!sample_table_begin
ID_REF VALUE ABS_CALL DETECTION P-VALUE
id1 105.8 M 0.00227496
id2 32 A 0.354441
id3 50.4 A 0.904352
id4 57.8 A 0.937071
id5 2914.1 P 6.02111e-05
!sample_table_end
GEO sample file: using absent/present call and p-value columns for sequence information
The CLC Main Workbench assumes that if there is a third column in the GEO sample file then it
contains present/absent calls and that if there is a fourth column then it contains p-values for
these calls. This means that the contents of the third column is assumed to be text and that
of the fourth column a number. As long as these two basic requirements are met, the sample
should be recognized and interpreted correctly.
You can thus use these two columns to carry additional information on your probes. The
absent/present column can be used to carry additional information like e.g. sequence tags as
shown below:
^SAMPLE = GSM21610
!sample_table_begin
ID_REF VALUE ABS_CALL
id1 105.8 AAA
id2 32 AAC
APPENDIX H. GENE EXPRESSION ANNOTATION FILES AND MICROARRAY DATA FORMATS 687
^SAMPLE = GSM21610
!sample_table_begin
ID_REF VALUE ABS_CALL DETECTION P-VALUE
probe1 755.07 seq1 1452
probe2 587.88 seq1 497
probe3 716.29 seq1 1447
probe4 1287.18 seq2 1899
!sample_table_end
and export a txt file containing a table of estimated probe-level log-transformed expression values
in three lines of code:
The exported txt file (evals.txt) can be imported into the CLC Main Workbench using the Generic ex-
pression data table format importer (see internalrefsec:customexpressiondataformatssectionGeneric
expression and annotation data file formats; you can just 'drag-and-drop' it in). In R, you should
have all the CEL files you wish to process in your working directory and the file 'evals.txt' will be
written to that directory.
If multiple probes are present for the same gene, further processing may be required to merge
them into a single gene-level expression.
All this information is imported into the CLC Main Workbench. The AVG_Signal is used as the
expression measure.
Download a small sample file here:
https://resources.qiagenbioinformatics.com/madata/IlluminaBeadChipCompact.
txt
All this information is imported into the CLC Main Workbench. The AVG_Signal is used as the
expression measure.
Download a small sample file here:
https://resources.qiagenbioinformatics.com/madata/IlluminaBeadChipExtended.
txt
Only the TargetID, Signal and Detection columns will be imported, the remaining columns will
be ignored. This means that the annotations are not imported. The Signal is used as the
expression measure.
Download a small example sample file here:
https://resources.qiagenbioinformatics.com/madata/IlluminaBeadStudioWithAnnotati
txt
able to import them into the CLC Main Workbench as a 'generic' expression or annotation data
file. There are a few simple requirements that need to be fulfilled to do this as described below.
1. the first non-empty line of the file contains text. All entries, except the first, will be used as
sample names
2. the following (non-empty) lines contain the same number of entries as the first non-empty
line. The requirements to these are that the first entry should be a string (this will be used
as the feature ID) and the remaining entries should contain numbers (which will be used as
expression values --- one per sample). Empty entries are not allowed, but NaN values are
allowed.
FeatureID;sample1;sample2;sample3
gene1;200;300;23
gene2;210;30;238
gene3;230;50;23
gene4;50;100;235
gene5;200;300;23
gene6;210;30;238
gene7;230;50;23
gene8;50;100;235
This will be imported as three samples with eight genes in each sample.
Download this example as a file here:
https://resources.qiagenbioinformatics.com/madata/CustomExpressionData.
txt
1. It has a line which can serve as a valid header line. In order to do this, the line should
have a number of headers where at least two are among the valid column headers in the
Column header column below.
2. It contains one of the PROBE_ID headers (that is: 'Probe Set ID', 'Feature ID', 'ProbeID' or
'Probe_Id').
APPENDIX H. GENE EXPRESSION ANNOTATION FILES AND MICROARRAY DATA FORMATS 692
The importer will import an annotation table with a column for each of the valid column headers
(those in the Column header column below). Columns with invalid headers will be ignored.
Note that some column headers are alternatives so that only one of the alternative columns
headers should be used.
When adding annotations to an experiment, you can specify the column in your annotation file
containing the relevant identifiers. These identifiers are matched to the feature ids already
present in your experiment. When a match is found, the annotation is added to that entry in the
experiment. In other words, at least one column in your annotation file must contain identfiers
matching the feature identifiers in the experiment, for those annotations to be applied.
A simple example of an annotation file is shown here:
To meet requirements imposed by special functionalities in the CLC Main Workbench, there are
a number of further restrictions on the contents in the entries of the columns:
Download sequence functionality In the experiment table, you can click a button to download
sequence. This uses the contents of the PUBLIC_ID column, so this column must be
present for the action to work and should contain the NCBI accession number.
Annotation tests The annotation tests can make use of several entries in a column as long
as a certain format is used. The tests assume that entries are separated by /// and it
interprets all that appears before // as the actual entry and all that appears after // within
an entry as comments. Example:
The annotation tests will interpret this as three entries (0000001, 0000008, and 0003746)
with the according comments.
Column header in imported file (alternatives separated by commas) Label in experiment table Description (tool tip)
Species Scientific Name, Species Name, Species Species name Scientific species name
GeneChip Array Gene chip array Gene Chip Array name
Annotation Date Annotation date Date of annotation
Sequence Type Sequence type Type of sequence
Sequence Source Sequence source Source from which sequence was obtained
Transcript ID(Array Design), Transcript Transcript ID Transcript identifier tag
You can edit the list of codon frequency tables used by CLC Main Workbench.
Note! Please be aware that this process needs to be handled carefully, otherwise you may
have to re-install the Workbench to get it to work.
In the Workbench installation folder under res, there is a folder named codonfreq. This
folder contains all the codon frequency tables organized into subfolders in a hierarchy. In order
to change the tables, you simply add, delete or rename folders and the files in the folders.
If you wish to add new tables, please use the existing ones as template. In existing tables,
the "_number" at the end of the ".cftbl" file name is the number of CDSs that were used for
calculation, according to the https://www.kazusa.or.jp/codon/ site.
When creating a custom table, it is not necessary to fill in all fields as only the codon information
(e.g. 'GCG' in the example below) and the counts (e.g. 47869.00) are used when doing reverse
translation:
Name: Rattus norvegicus GeneticCode: 1 Ala GCG 47869.00 6.86 0.10 Ala GCA 109203.00
15.64 0.23 ....
In particular, the amino acid type is not used: in order to use an alternative genetic code, it must
be specified in the 'GeneticCode' line instead.
Restart the Workbench to have the changes take effect.
694
Bibliography
[Allison et al., 2006] Allison, D., Cui, X., Page, G., and Sabripour, M. (2006). Microarray data
analysis: from disarray to consolidation and consensus. NATURE REVIEWS GENETICS, 7(1):55.
[Altschul et al., 1990] Altschul, S. F., Gish, W., Miller, W., Myers, E. W., and Lipman, D. J.
(1990). Basic local alignment search tool. J Mol Biol, 215(3):403--410.
[Andrade et al., 1998] Andrade, M. A., O'Donoghue, S. I., and Rost, B. (1998). Adaptation of
protein surfaces to subcellular location. J Mol Biol, 276(2):517--525.
[Ashburner et al., 2000] Ashburner, M., Ball, C. A., Blake, J. A., Botstein, D., Butler, H., Cherry,
J. M., Davis, A. P., Dolinski, K., Dwight, S. S., Eppig, J. T., Harris, M. A., Hill, D. P., Issel-Tarver,
L., Kasarskis, A., Lewis, S., Matese, J. C., Richardson, J. E., Ringwald, M., Rubin, G. M., and
Sherlock, G. (2000). Gene ontology: tool for the unification of biology. Nat Genet, 25(1):25--29.
[Bachmair et al., 1986] Bachmair, A., Finley, D., and Varshavsky, A. (1986). In vivo half-life of a
protein is a function of its amino-terminal residue. Science, 234(4773):179--186.
[Baggerly et al., 2003] Baggerly, K., Deng, L., Morris, J., and Aldaz, C. (2003). Differen-
tial expression in SAGE: accounting for normal between-library variation. Bioinformatics,
19(12):1477--1483.
[Bateman et al., 2004] Bateman, A., Coin, L., Durbin, R., Finn, R. D., Hollich, V., Griffiths-Jones,
S., Khanna, A., Marshall, M., Moxon, S., Sonnhammer, E. L., Studholme, D. J., Yeats, C.,
and Eddy, S. R. (2004). The Pfam protein families database. Nucleic Acids Res., 32(Database
issue):D138--D141.
[Benjamini and Hochberg, 1995] Benjamini, Y. and Hochberg, Y. (1995). Controlling the false
discovery rate: a practical and powerful approach to multiple testing. JOURNAL-ROYAL
STATISTICAL SOCIETY SERIES B, 57:289--289.
[Berman et al., 2003] Berman, H., Henrick, K., and Nakamura, H. (2003). Announcing the
worldwide protein data bank. Nat Struct Biol, 10(12):980.
[Bishop and Friday, 1985] Bishop, M. J. and Friday, A. E. (1985). Evolutionary trees from nucleic
acid and protein sequences. Proceeding of the Royal Society of London, B 226:271--302.
[Blaisdell, 1989] Blaisdell, B. E. (1989). Average values of a dissimilarity measure not requir-
ing sequence alignment are twice the averages of conventional mismatch counts requiring
sequence alignment for a computer-generated model system. J Mol Evol, 29(6):538--47.
[Bolstad et al., 2003] Bolstad, B., Irizarry, R., Astrand, M., and Speed, T. (2003). A comparison
of normalization methods for high density oligonucleotide array data based on variance and
bias. Bioinformatics, 19(2):185--193.
695
BIBLIOGRAPHY 696
[Bommarito et al., 2000] Bommarito, S., Peyret, N., and SantaLucia, J. (2000). Thermodynamic
parameters for DNA sequences with dangling ends. Nucleic Acids Res, 28(9):1929--1934.
[Chen et al., 2004] Chen, G., Znosko, B. M., Jiao, X., and Turner, D. H. (2004). Factors affecting
thermodynamic stabilities of RNA 3 x 3 internal loops. Biochemistry, 43(40):12865--12876.
[Clote et al., 2005] Clote, P., Ferré, F., Kranakis, E., and Krizanc, D. (2005). Structural RNA has
lower folding energy than random RNA of the same dinucleotide frequency. RNA, 11(5):578--
591.
[Cornette et al., 1987] Cornette, J. L., Cease, K. B., Margalit, H., Spouge, J. L., Berzofsky, J. A.,
and DeLisi, C. (1987). Hydrophobicity scales and computational techniques for detecting
amphipathic structures in proteins. J Mol Biol, 195(3):659--685.
[Costa, 2007] Costa, F. F. (2007). Non-coding RNAs: lost in translation? Gene, 386(1-2):1--10.
[Crooks et al., 2004] Crooks, G. E., Hon, G., Chandonia, J.-M., and Brenner, S. E. (2004).
WebLogo: a sequence logo generator. Genome Res, 14(6):1188--1190.
[Dayhoff and Schwartz, 1978] Dayhoff, M. O. and Schwartz, R. M. (1978). Atlas of Protein
Sequence and Structure, volume 3 of 5 suppl., pages 353--358. Nat. Biomed. Res. Found.,
Washington D.C.
[Dayhoff et al., 1978] Dayhoff, M. O., Schwartz, R. M., and Orcutt, B. C. (1978). A model of
evolutionary change in protein. Atlas of Protein Sequence and Structure, 5(3):345--352.
[Dempster et al., 1977] Dempster, A., Laird, N., Rubin, D., et al. (1977). Maximum likelihood
from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, 39(1):1--38.
[Dudoit et al., 2003] Dudoit, S., Shaffer, J., and Boldrick, J. (2003). Multiple Hypothesis Testing
in Microarray Experiments. STATISTICAL SCIENCE, 18(1):71--103.
[Eddy, 2004] Eddy, S. R. (2004). Where did the BLOSUM62 alignment score matrix come from?
Nat Biotechnol, 22(8):1035--1036.
[Edgar, 2004] Edgar, R. C. (2004). Muscle: a multiple sequence alignment method with reduced
time and space complexity. BMC Bioinformatics, 5:113.
[Efron, 1982] Efron, B. (1982). The jackknife, the bootstrap and other resampling plans, vol-
ume 38. SIAM.
[Eisen et al., 1998] Eisen, M., Spellman, P., Brown, P., and Botstein, D. (1998). Cluster analysis
and display of genome-wide expression patterns. Proceedings of the National Academy of
Sciences, 95(25):14863--14868.
[Eisenberg et al., 1984] Eisenberg, D., Schwarz, E., Komaromy, M., and Wall, R. (1984). Analysis
of membrane and surface protein sequences with the hydrophobic moment plot. J Mol Biol,
179(1):125--142.
[Emini et al., 1985] Emini, E. A., Hughes, J. V., Perlow, D. S., and Boger, J. (1985). Induction of
hepatitis a virus-neutralizing antibody by a virus-specific synthetic peptide. J Virol, 55(3):836--
839.
BIBLIOGRAPHY 697
[Engelman et al., 1986] Engelman, D. M., Steitz, T. A., and Goldman, A. (1986). Identifying
nonpolar transbilayer helices in amino acid sequences of membrane proteins. Annu Rev
Biophys Biophys Chem, 15:321--353.
[Falcon and Gentleman, 2007] Falcon, S. and Gentleman, R. (2007). Using GOstats to test gene
lists for GO term association. Bioinformatics, 23(2):257.
[Felsenstein, 1981] Felsenstein, J. (1981). Evolutionary trees from DNA sequences: a maximum
likelihood approach. J Mol Evol, 17(6):368--376.
[Feng and Doolittle, 1987] Feng, D. F. and Doolittle, R. F. (1987). Progressive sequence align-
ment as a prerequisite to correct phylogenetic trees. J Mol Evol, 25(4):351--360.
[Forsberg et al., 2001] Forsberg, R., Oleksiewicz, M. B., Petersen, A. M., Hein, J., Bøtner, A., and
Storgaard, T. (2001). A molecular clock dates the common ancestor of European-type porcine
reproductive and respiratory syndrome virus at more than 10 years before the emergence of
disease. Virology, 289(2):174--179.
[Galperin and Koonin, 1998] Galperin, M. Y. and Koonin, E. V. (1998). Sources of systematic
error in functional annotation of genomes: domain rearrangement, non-orthologous gene
displacement and operon disruption. In Silico Biol, 1(1):55--67.
[Gentleman and Mullin, 1989] Gentleman, J. F. and Mullin, R. (1989). The distribution of the
frequency of occurrence of nucleotide subsequences, based on their overlap capability.
Biometrics, 45(1):35--52.
[Gill and von Hippel, 1989] Gill, S. C. and von Hippel, P. H. (1989). Calculation of protein
extinction coefficients from amino acid sequence data. Anal Biochem, 182(2):319--326.
[Gonda et al., 1989] Gonda, D. K., Bachmair, A., Wünning, I., Tobias, J. W., Lane, W. S.,
and Varshavsky, A. (1989). Universality and structure of the N-end rule. J Biol Chem,
264(28):16700--16712.
[Guindon and Gascuel, 2003] Guindon, S. and Gascuel, O. (2003). A Simple, Fast, and Accu-
rate Algorithm to Estimate Large Phylogenies by Maximum Likelihood. Systematic Biology,
52(5):696--704.
[Guo et al., 2006] Guo, L., Lobenhofer, E. K., Wang, C., Shippy, R., Harris, S. C., Zhang, L., Mei,
N., Chen, T., Herman, D., Goodsaid, F. M., Hurban, P., Phillips, K. L., Xu, J., Deng, X., Sun,
Y. A., Tong, W., Dragan, Y. P., and Shi, L. (2006). Rat toxicogenomic study reveals analytical
consistency across microarray platforms. Nat Biotechnol, 24(9):1162--1169.
[Han et al., 1999] Han, K., Kim, D., and Kim, H. (1999). A vector-based method for drawing RNA
secondary structure. Bioinformatics, 15(4):286--297.
[Hasegawa et al., 1985] Hasegawa, M., Kishino, H., and Yano, T. (1985). Dating of the human-
ape splitting by a molecular clock of mitochondrial DNA. Journal of Molecular Evolution,
22(2):160--174.
[Henikoff and Henikoff, 1992] Henikoff, S. and Henikoff, J. G. (1992). Amino acid substitution
matrices from protein blocks. Proc Natl Acad Sci U S A, 89(22):10915--10919.
BIBLIOGRAPHY 698
[Höhl et al., 2007] Höhl, M., Rigoutsos, I., and Ragan, M. A. (2007). Pattern-based phylogenetic
distance estimation and tree reconstruction. Evolutionary Bioinformatics, 2:0--0.
[Hopp and Woods, 1983] Hopp, T. P. and Woods, K. R. (1983). A computer program for predicting
protein antigenic determinants. Mol Immunol, 20(4):483--489.
[Ikai, 1980] Ikai, A. (1980). Thermostability and aliphatic index of globular proteins. J Biochem
(Tokyo), 88(6):1895--1898.
[Janin, 1979] Janin, J. (1979). Surface and inside volumes in globular proteins. Nature,
277(5696):491--492.
[Jones et al., 1992] Jones, D., Taylor, W., and Thornton, J. (1992). The rapid generation of
mutation data matrices from protein sequences. Computer Applications in the Biosciences
(CABIOS), 8:275--282.
[Jukes and Cantor, 1969] Jukes, T. and Cantor, C. (1969). Mammalian Protein Metabolism,
chapter Evolution of protein molecules, pages 21--32. New York: Academic Press.
[Kal et al., 1999] Kal, A. J., van Zonneveld, A. J., Benes, V., van den Berg, M., Koerkamp, M. G.,
Albermann, K., Strack, N., Ruijter, J. M., Richter, A., Dujon, B., Ansorge, W., and Tabak,
H. F. (1999). Dynamics of gene expression revealed by comparison of serial analysis of gene
expression transcript profiles from yeast grown on two different carbon sources. Mol Biol Cell,
10(6):1859--1872.
[Karplus and Schulz, 1985] Karplus, P. A. and Schulz, G. E. (1985). Prediction of chain flexibility
in proteins. Naturwissenschaften, 72:212--213.
[Kaufman and Rousseeuw, 1990] Kaufman, L. and Rousseeuw, P. (1990). Finding groups in
data. an introduction to cluster analysis. Wiley Series in Probability and Mathematical Statistics.
Applied Probability and Statistics, New York: Wiley, 1990.
[Kierzek et al., 1999] Kierzek, R., Burkard, M. E., and Turner, D. H. (1999). Thermodynamics of
single mismatches in RNA duplexes. Biochemistry, 38(43):14214--14223.
[Kimura, 1980] Kimura, M. (1980). A simple method for estimating evolutionary rates of base
substitutions through comparative studies of nucleotide sequences. J Mol Evol, 16(2):111--
120.
[Knudsen and Miyamoto, 2001] Knudsen, B. and Miyamoto, M. M. (2001). A likelihood ratio
test for evolutionary rate shifts and functional divergence among proteins. Proc Natl Acad Sci
U S A, 98(25):14512--14517.
[Knudsen and Miyamoto, 2003] Knudsen, B. and Miyamoto, M. M. (2003). Sequence alignments
and pair hidden markov models using evolutionary history. Journal of Molecular Biology,
333(2):453 -- 460.
[Kyte and Doolittle, 1982] Kyte, J. and Doolittle, R. F. (1982). A simple method for displaying
the hydropathic character of a protein. J Mol Biol, 157(1):105--132.
BIBLIOGRAPHY 699
[Leitner and Albert, 1999] Leitner, T. and Albert, J. (1999). The molecular clock of HIV-1 unveiled
through analysis of a known transmission history. Proc Natl Acad Sci U S A, 96(19):10752--
10757.
[Lloyd, 1982] Lloyd, S. (1982). Least squares quantization in PCM. Information Theory, IEEE
Transactions on, 28(2):129--137.
[Longfellow et al., 1990] Longfellow, C. E., Kierzek, R., and Turner, D. H. (1990). Thermodynamic
and spectroscopic study of bulge loops in oligoribonucleotides. Biochemistry, 29(1):278--285.
[Lu et al., 2008] Lu, M., Dousis, A. D., and Ma, J. (2008). Opus-rota: A fast and accurate
method for side-chain modeling. Protein Science, 17(9):1576--1585.
[Maizel and Lenk, 1981] Maizel, J. V. and Lenk, R. P. (1981). Enhanced graphic matrix analysis
of nucleic acid and protein sequences. Proc Natl Acad Sci U S A, 78(12):7665--7669.
[Mathews et al., 2004] Mathews, D. H., Disney, M. D., Childs, J. L., Schroeder, S. J., Zuker,
M., and Turner, D. H. (2004). Incorporating chemical modification constraints into a dynamic
programming algorithm for prediction of rna secondary structure. Proc Natl Acad Sci U S A,
101(19):7287--7292.
[Mathews et al., 1999] Mathews, D. H., Sabina, J., Zuker, M., and Turner, D. H. (1999).
Expanded sequence dependence of thermodynamic parameters improves prediction of rna
secondary structure. J Mol Biol, 288(5):911--940.
[Mathews and Turner, 2002] Mathews, D. H. and Turner, D. H. (2002). Experimentally derived
nearest-neighbor parameters for the stability of RNA three- and four-way multibranch loops.
Biochemistry, 41(3):869--880.
[Mathews and Turner, 2006] Mathews, D. H. and Turner, D. H. (2006). Prediction of RNA
secondary structure by free energy minimization. Curr Opin Struct Biol, 16(3):270--278.
[McCaskill, 1990] McCaskill, J. S. (1990). The equilibrium partition function and base pair
binding probabilities for RNA secondary structure. Biopolymers, 29(6-7):1105--1119.
[McGinnis and Madden, 2004] McGinnis, S. and Madden, T. L. (2004). BLAST: at the core of
a powerful and diverse set of sequence analysis tools. Nucleic Acids Res, 32(Web Server
issue):W20--W25.
[Miao et al., 2011] Miao, Z., Cao, Y., and Jiang, T. (2011). Rasp: rapid modeling of protein side
chain conformations. Bioinformatics, 27(22):3117--3122.
[Michener and Sokal, 1957] Michener, C. and Sokal, R. (1957). A quantitative approach to a
problem in classification. Evolution, 11:130--162.
[Mortazavi et al., 2008] Mortazavi, A., Williams, B. A., McCue, K., Schaeffer, L., and Wold,
B. (2008). Mapping and quantifying mammalian transcriptomes by rna-seq. Nat Methods,
5(7):621--628.
[Mukherjee and Zhang, 2009] Mukherjee, S. and Zhang, Y. (2009). MM-align: A quick algorithm
for aligning multiple-chain protein complex structures using iterative dynamic programming.
Nucleic Acids Res., 37.
BIBLIOGRAPHY 700
[Pace et al., 1995] Pace, C. N., Vajdos, F., Fee, L., Grimsley, G., and Gray, T. (1995). How to
measure and predict the molar absorption coefficient of a protein. Protein science, 4(11):2411-
-2423.
[Purvis, 1995] Purvis, A. (1995). A composite estimate of primate phylogeny. Philos Trans R Soc
Lond B Biol Sci, 348(1326):405--421.
[Rivas and Eddy, 2000] Rivas, E. and Eddy, S. R. (2000). Secondary structure alone is generally
not statistically significant for the detection of noncoding RNAs. Bioinformatics, 16(7):583--605.
[Rose et al., 1985] Rose, G. D., Geselowitz, A. R., Lesser, G. J., Lee, R. H., and Zehfus, M. H.
(1985). Hydrophobicity of amino acid residues in globular proteins. Science, 229(4716):834--
838.
[Rost, 2001] Rost, B. (2001). Review: protein secondary structure prediction continues to rise.
J Struct Biol, 134(2-3):204--218.
[Saitou and Nei, 1987] Saitou, N. and Nei, M. (1987). The neighbor-joining method: a new
method for reconstructing phylogenetic trees. Mol Biol Evol, 4(4):406--425.
[Sankoff et al., 1983] Sankoff, D., Kruskal, J., Mainville, S., and Cedergren, R. (1983). Time
Warps, String Edits, and Macromolecules: the Theory and Practice of Sequence Comparison,
chapter Fast algorithms to determine RNA secondary structures containing multiple loops,
pages 93--120. Addison-Wesley, Reading, Ma.
[SantaLucia, 1998] SantaLucia, J. (1998). A unified view of polymer, dumbbell, and oligonu-
cleotide DNA nearest-neighbor thermodynamics. Proc Natl Acad Sci U S A, 95(4):1460--1465.
[Schechter and Berger, 1967] Schechter, I. and Berger, A. (1967). On the size of the active site
in proteases. I. Papain. Biochem Biophys Res Commun, 27(2):157--162.
[Schechter and Berger, 1968] Schechter, I. and Berger, A. (1968). On the active site of pro-
teases. 3. Mapping the active site of papain; specific peptide inhibitors of papain. Biochem
Biophys Res Commun, 32(5):898--902.
[Schneider and Stephens, 1990] Schneider, T. D. and Stephens, R. M. (1990). Sequence logos:
a new way to display consensus sequences. Nucleic Acids Res, 18(20):6097--6100.
[Schroeder et al., 1999] Schroeder, S. J., Burkard, M. E., and Turner, D. H. (1999). The
energetics of small internal loops in RNA. Biopolymers, 52(4):157--167.
[Shapiro et al., 2007] Shapiro, B. A., Yingling, Y. G., Kasprzak, W., and Bindewald, E. (2007).
Bridging the gap in RNA structure prediction. Curr Opin Struct Biol, 17(2):157--165.
[Siepel and Haussler, 2004] Siepel, A. and Haussler, D. (2004). Combining phylogenetic and
hidden Markov models in biosequence analysis. J Comput Biol, 11(2-3):413--428.
[Smith and Waterman, 1981] Smith, T. F. and Waterman, M. S. (1981). Identification of common
molecular subsequences. J Mol Biol, 147(1):195--197.
[Sturges, 1926] Sturges, H. A. (1926). The choice of a class interval. Journal of the American
Statistical Association, 21:65--66.
[The Gene Ontology Consortium, 2019] The Gene Ontology Consortium (2019). Gene ontology
resource: 20 years and still going strong. Nucleic Acids Research, 47(D1):D330--D338.
BIBLIOGRAPHY 701
[Tian et al., 2005] Tian, L., Greenberg, S., Kong, S., Altschuler, J., Kohane, I., and Park,
P. (2005). Discovering statistically significant pathways in expression profiling studies.
Proceedings of the National Academy of Sciences, 102(38):13544--13549.
[Tobias et al., 1991] Tobias, J. W., Shrader, T. E., Rocap, G., and Varshavsky, A. (1991). The
N-end rule in bacteria. Science, 254(5036):1374--1377.
[Tusher et al., 2001] Tusher, V. G., Tibshirani, R., and Chu, G. (2001). Significance analysis of
microarrays applied to the ionizing radiation response. Proc Natl Acad Sci U S A, 98(9):5116--
5121.
[von Ahsen et al., 2001] von Ahsen, N., Wittwer, C. T., and Schütz, E. (2001). Oligonucleotide
melting temperatures under PCR conditions: nearest-neighbor corrections for Mg(2+), deoxynu-
cleotide triphosphate, and dimethyl sulfoxide concentrations with comparison to alternative
empirical formulas. Clin Chem, 47(11):1956--1961.
[Welling et al., 1985] Welling, G. W., Weijer, W. J., van der Zee, R., and Welling-Wester, S.
(1985). Prediction of sequential antigenic regions in proteins. FEBS Lett, 188(2):215--218.
[Whelan and Goldman, 2001] Whelan, S. and Goldman, N. (2001). A general empirical model of
protein evolution derived from multiple protein families using a maximum-likelihood approach.
Molecular Biology and Evolution, 18:691--699.
[Wootton and Federhen, 1993] Wootton, J. C. and Federhen, S. (1993). Statistics of local
complexity in amino acid sequences and sequence databases. Computers in Chemistry,
17:149--163.
[Workman and Krogh, 1999] Workman, C. and Krogh, A. (1999). No evidence that mRNAs have
lower folding free energies than random sequences with the same dinucleotide distribution.
Nucleic Acids Res, 27(24):4816--4822.
[Xu and Zhang, 2010] Xu, J. and Zhang, Y. (2010). How significant is a protein structure similarity
with TM-score = 0.5? Bioinformatics, 26(7):889--95.
[Yang, 1994a] Yang, Z. (1994a). Estimating the pattern of nucleotide substitution. Journal of
Molecular Evolution, 39(1):105--111.
[Yang, 1994b] Yang, Z. (1994b). Maximum likelihood phylogenetic estimation from DNA se-
quences with variable rates over sites: Approximate methods. Journal of Molecular Evolution,
39(3):306--314.
[Zhang and Skolnick, 2004] Zhang, Y. and Skolnick, J. (2004). Scoring function for automated
assessment of protein structure template quality. Proteins, 57(4):702--10.
[Zuker, 1989a] Zuker, M. (1989a). On finding all suboptimal foldings of an rna molecule.
Science, 244(4900):48--52.
[Zuker, 1989b] Zuker, M. (1989b). The use of dynamic programming algorithms in rna secondary
structure prediction. Mathematical Methods for DNA Sequences, pages 159--184.
[Zuker and Sankoff, 1984] Zuker, M. and Sankoff, D. (1984). Rna secondary structures and
their prediction. Bulletin of Mathemetical Biology, 46:591--621.
BIBLIOGRAPHY 702
[Zuker and Stiegler, 1981] Zuker, M. and Stiegler, P. (1981). Optimal computer folding of
large RNA sequences using thermodynamics and auxiliary information. Nucleic Acids Res,
9(1):133--148.