Remote Admin Training
Remote Admin Training
1
Agenda
Day 1 Day 2
• Module 1: Dataiku Overview + Architecture • Module 5: ode Environments
• Lab: Installing DSS • Lab: Maintaining Code Evns
2
Pre-requisites
3
Module 1:
4
Dataiku Overview
5
YOUR PATH TO ENTERPRISE AI
WHAT DOES IT MEAN?
Business Analyst Data Scientist Business Analyst Data Scientist Business Analyst
VISUAL AUTO PREP CODING VISUAL AUTO ML VISUAL PIPELINE MODEL DEPLOYMENT VISUAL MODEL MONITORING
ENVIRONMENT(S)
Understand
Progress
Optimize Integrate
Speed
Monitor
Use For Productivity Results
Use as a Baseline
And Extend
Use for optimization
Prototype Reuse
Enable fast prototyping Augment or Replace Manual
(incl. data integration) for process thanks to AI
Detection of dead-ends
9
Dataiku Architecture
10
Dataiku DSS Architecture, Ready For Production
Development Zone Data Production Zone Web Production Zone
Business
Analyst
Deploy Workflow Deploy Model
End
Data Users
Scientist DESIGN Node AUTOMATION DEPLOYER and SCORING
Node Node
Run in Memory
Run in Database Python, R, …
Enterprise SQL,
Analytic SQL
By default, DSS Distributed ML
Run In Cluster Mllib, H2O, …
automatically chooses Spark, Impala, Hive, …
the most effective
execution engine
amongst the engines
available next to the
input data of each
computation. Data Lake
Cassandra,
ML in Memory
HDFS, …
S3 Python Scikit-Learn, R, …
Database Data
Vertica, File System Data
Greenplum,
Redshift, Host File System,
PostgreSQL, Remote File System,
… 12
…
The Dataiku DSS Architecture (simplified)
Hadoop Cluster
Project A
Project B
etc. Remote FS
etc.
External Compute Cloud Storage
In Memory
Compute 13
Example of Full Life Cycle of a Project
API
Project Design Project Testing Project Testing
API PRE-PROD
Development AWS
HDFS
Data HDFS
API
API PRODUCTION
API Production
Production
HDFS
Data API PRODUCTION 14
Enterprise Scale Sizing Recommendation
Design nodes are generally consume more memory than other because
Design node it’s the collaborative environment for design prototyping and
128-256 GB experiments.
Automation node will run maintain and monitor project workflows and
Automation node models in production. Since the majority of actions are batches you can
64-128 GB partition the activity in the 24 hours and optimize resource consumption.
(+ 64 GB in preprod) You can also use a non production automation node to validate your
project before going to production
Scoring nodes are real time production nodes for scoring or labeling with
Scoring node
prediction models. A single node doesn’t require a lot of memory but these
4+ GB per node
nodes are generally deployed on dedicated clusters of containers or virtual
fleet of n nodes
machines
Memory usage on the DSS server side can be controlled at the Unix level when DSS impersonation is activated
Database resource management can be done on the DB side at the user level when per-user credentials mode is activated 15
DSS Components and Processes
NGINX
17
DSS Components and processes
BACKEND
Backend is a single point of failure. It won't go down alone! Hence it is supposed to handle as little actual
processing as possible. Backend can spawn child processes: custom scenario steps/triggers, Scala
validation, API node DevServer, macros, etc. 18
DSS Components and processes
IPYTHON (JUPYTER)
19
DSS Components and processes
20
DSS Components and processes
FUTURE EXECUTION KERNEL (FEK)
21
DSS Components and processes
ANALYSIS SERVER
WEBAPP BACKEND
22
Open Ports
Base Installations
● Design: User’s choice of base TCP port (default 11200) + next 9 consecutive ports
Only the first of these ports needs to be opened out of the machine. It is highly recommended to firewall the other ports
● Automation: User’s choice of base TCP port (default 12200) + next 9 consecutive ports
Only the first of these ports needs to be opened out of the machine. It is highly recommended to firewall the other ports
● API: User’s choice of base TCP (default 13200) + next 9 consecutive ports
Only the first of these ports needs to be opened out of the machine. It is highly recommended to firewall the other ports
Supporting Installations
● Data sources: JDBC entry point; network connectivity
● Hadoop: ports + workers required by specific distribution; network connectivity
● Spark: executor + callback (two way connection) to DSS
Privileged Ports
● DSS itself cannot run on ports 80 or 443 because it does not run as root, and cannot bind to these privileged ports.
● The recommended setup to have DSS available on ports 80 or 443 is to have a reverse proxy (nginx or apache) running on the same
machine, forwarding traffic from ports 80 / 443 to the DSS port. (https://doc.dataiku.com/dss/latest/installation/proxies.html)
23
Installing DSS
24
Command Line Installation
(the easy part)
The Data Science Studio Installation process is fairly straightforward. Due to the number of options available, we
do have several commands to issue for a full installation. There are a couple of important terms to understand
before we start.
● DSSUSER -- This is a Linux User ID that will run DSS. It does not require elevated privileges.
● DATADIR -- This is the directory where DSS will install binaries, libraries, configurations and store all data.
● INSTALLDIR -- This is the directory created when you extract the DSS tar file.
● DSSPORT -- This is the first port that DSS Web Server opens to present the Web UI. We request 9 additional
ports, in sequence, for interprocess communications.
● Hadoop Proxy User -- If you are connecting to a Hadoop cluster with Multi-User Security, the Proxy User
configuration must be enabled. Additional details are contained in our reference documentation.
● Kerberos Keytab -- If your Hadoop cluster users Kerberos, we will need a keytab file for the DSSUSER.
25
Key integration points
26
Example Install Commands
As root:
/home/dataiku/dssdata/bin/dssadmin install-impersonation DSS_USER
27
Upgrading
Upgrade Options:
1. In place (recommended)
a. ./install_dir/installer.sh -d <path to data_dir> -u
2. Project Export/Import
a. tedious
3. Cloning
a. be careful of installing on the same machine (port conflicts, overwriting directories, etc)
28
Time for the Lab!
29
Module 2:
DSS Integrations
30
SQL Integrations
31
DSS and SQL - Supported flavors
32
DSS and SQL - Installing the Database Driver
33
DSS and SQL - Defining a connection through the UI.
We already have a PostgreSQL connection on DSS, but these would be the steps to follow to create your
connection
Fetch size
35
DSS and SQL - Connection parameters.
Relocation of SQL datasets
● Schema: ${projectKey}
● Table name prefix: ${myvar1}_
● Table name suffix: _dss
36
Hadoop Integrations
37
DSS and Hadoop - Supported flavors
38
Installing HDFS Integration
● DSS node should be set up as edge node to cluster.
○ I.E. common client tools should function, such as “hdfs dfs”, “hive”/ “beeline”, “spark-shell”/
“pyspark” / “spark-submit”
○ ./bin/dssadmin install-hadoop-integration
39
HDFS connection
Metastore DSS
• Welcome to the danger zone : clean way to rename a dataset without removing the data ?
41
Managed (Internal) vs External Hive tables
• Managed tables
• Managed by hive user
• Location : /user/hive/warehouse
• DROP TABLE : remove also the data
• Security managed with the Hive schema and a service like Sentry/Ranger, etc. : GRANT ROLE …
• External tables
• CREATE EXTERNAL TABLE ( … ) LOCATION ‘/path/to/folder’
• DROP TABLE : remove the hive table but not the data
• Security : filesystem permissions of the folder
42
Exposing HDFS data to end-users
• Depend on your data ingestion process : raw data are put on HDFS
• From files
○ Create a HDFS dataset and specify the HDFS path of your files
○ dss_service_account needs to have access to those files
43
Exposing Hive data to end-users
• Depend on your data ingestion process : Hive tables/views exists
44
Hive config
• HiveServer2 :
• Recommended mode (others may be depreciated in the future)
• Mandatory for MUS
• Mandatory for notebooks and metrics
• Target the global metastore
• When MUS is not activated, you can have access to every Hive tables created by DSS, even if
your user doesn’t have access to the related HDFS connections
45
Multiple Clusters in DSS
• DSS can create compute clusters in
ADMIN > CLUSTERS
Warnings:
● For DSS to work with a cluster, it needs to have the necessary binaries and
client configurations available.
● DSS can only work with one set of binaries, meaning that a single DSS
instance can only work with one Hadoop vendor/version.
○ DSS “cluster” definitions override global cluster client configs.
● For secure clusters, DSS is only configured to use one keytab, so all clusters
must accept that keytab (same realm or cross-ream trust)
47
Spark Integrations
48
Spark Supported Flavors + Usage
49
Installing Spark
○ Note that DSS can only work with one spark version.
50
Spark Configuration
● Global Settings:
○ Admins can create spark configurations in
ADMIN > SETTINGS > SPARK. These define
spark settings that users can leverage.
● Project/Local Settings:
○ Project admin may also set spark conf at
the project level. SETTINGS > ENGINES &
CONNECTIONS
52
Notes on Spark Usage
● It is highly advisable to have spark read from an HDFS connection (even if it’s
on cloud storage, set up a HDFS connection w/ the proper scheme).
○ Spark is able to properly read dataset from HDFS connection and parallelize it accordingly.
○ Spark is also able to read optimized formats with the HDFS connector (parquet/ORC/etc),
whereas more native connectors don’t understand these formats
○ For non-HDFS/non-S3 datasets, spark will read the dataset in a single thread and create 1
partition. This may likely be non-optimal, so users will need to repartition the dataset before
any serious work on large datasets.
○ For HDFS datasets, Groups using it should be able to read details of the dataset.
53
Spark Multi-cluster
● EMR and Dataproc (experimental) plugins are also options, outside of normal
hadoop distributions (CDH/HDP).
54
Time for the Lab!
55
Module 3:
Security
56
DSS Security
57
User Identity
User Identity
● Users come from 1 of two locations:
○ local db
○ LDAP/AD
User Authentication
● Users are authenticated via:
○ local password
○ LDAP
○ SSO
Users can be 1 of three types:
● Local (local acct/local pass)
● LDAP (ldap acct/ldap pass or SSO)
● Local No Auth (local acct/sso)
58
LDAP(S) Integration
4 main pieces of information to provide:
● LDAP Connection: obtain from LDAP admin
● User Mapping: Filter corresponding to users
in DSS.
○ specify which attributes are display name and
email
○ toggle whether users are automatically imported
or not
● Group Mapping: Filter defining to which
groups a user belongs
○ specify attribute for group name
○ optionally white list groups that can connect to
DSS
● Profile Mapping: Define what profile a
group is assigned to
59
SSO Integration
● Users can be from local DB or LDAP
● Supports SAMLv2 (recommended) and SPENEGO
● For SAML need:
○ IdP Metadata (provided by SSO admin)
■ Will likely need a callback url:
https://dss.mycompany.corp/dip
/api/saml-callback
○ SP Metadata (generate)
■ If there’s no internal process, you can do
this online. Will need at least entityID (from
IdP Metadata) and Attribute Consume
Service Endpoint (callback url). X.509 certs
are also not uncommon get from the IdP
Metadata.
○ Login Attribute
■ Attribute in the assertion sent by IdP that
contains the DSS login.
○ Login Remapping Rules
■ Rules to map login attribute to user login.
■ I.E. first.last@company.com → first.last via
([^@]*)@mydomain.com -> $1
60
Permission Model
Multi-faceted tools to control security in
● Group:
the system:
○ Collection of users
○ Defines Global Permissions (i.e. are you
● Users: an admin? Can you create connections?
○ Must exist to login into DSS
○ Belong to a GROUP etc)
○ Have a PROFILE
● Projects:
● User Profile: ○ Determines privilege of each GROUP
○ Mainly a licensing mechanism ○ Can enforce project-level settings (lock
○ Designer: R/W access code env, etc)
■ aka Data Scientist/Data Analyst
○ Explorer: R access only
■ aka Reader ● Data Connections:
○ Grant access to GROUPS
○ Some connections allow per-user
credentials
61
Permission Model
Users
● Users get assigned profile + group.
○ Can determine this automatically via
mapping rules, as discussed previously
Note that in new licenses, the Data Analyst does not exist anymore:
63
Permission Model
Global Group Permissions
- On each project, you can configure an arbitrary number of groups who have access to this project. Adding
permissions to projects is done in the Project Settings, in the Security section.
- Each group can have one or several of the following permissions. By default, groups don’t have any kind of access to
a project.
- Being the owner of a project does not grant any additional permissions compared to being in a group who has
Administrator access to this project.
- This owner status is used mainly to actually grant access to a project to the user who just created it.
65
Permission Model
Additional Project Security
● Dashboard Authorizations
○ Which Objects can be accessed Dashboard-only users
● Dashboard Users
○ Add external users who are able to access Dashboards 66
Permission Model
Additional Project Settings
● Cluster Selection
○ Select default Cluster to use
● Container Exec
○ Specify default container env
67
Permission Model
Data Connections
69
GDPR
In order to help our customers better comply with GDPR(General Data Protection
Regulation), DSS provides a GDPR plugin which enables additional security
features.
● Configure GDPR admins and
documentation groups
● Document datasets as having personal data
● Project level settings to control specific functionality:
○ Forbid Dataset Sharing
○ Forbid Dataset/Project Export
○ Forbid Model Creation
○ Forbid uploaded Datasets
○ Blacklist Connections
● Easily filter to find sensitive datasets
70
Time for the Lab!
71
Module 4
72
DSS Automation Node Overview
73
Production in DSS - O16n
Deploying a Data Science project to production
Project in production
environment
Sandbox project
Operationalization
(o16n)
74
Deployment to production - Motivation
Why do we need a separate environment for our Project ?
We want to have a safe environment where our prediction project is not at risk of being altered by
modifications in the flow. We also want to be connected to our production databases.
We want to be able to have health checks on our data, monitor failures in building our flow and be
able to roll-back to previous versions of the flow if needed.
AUTOMATION Node
75
Installing/Configuring an Automation Node
Once the design node is set up, the automation node is straightforward to set-up.
● Install Automation Node via:
dataiku-dss-VERSION/installer.sh -t automation -d DATA_DIR -p PORT -l LICENSE_FILE
○ DATA_DIR and PORT are unique to the automation node. I.E. Do NOT use the same
ones used for the Design Node.
● Once installed, configured it exactly like we did the design node. I.E.
○ R integration
○ Hadoop Integration
○ Spark Integration
○ Set up dataset connections
○ Users/Group setup
○ Multi-user Security, etc.
76
DSS Design to Automation Workflow
77
From Design Node to Automation Node
Moving a project from the Design Node to the Automation Node takes a few straightforward
steps:
1) “Bundle” your project in the Design Node : this will create a zip file with all your
project’s config
2) Download the bundle to your local machine.
3) Upload the bundle to the Automation Node to create a new project or update an
existing one.
This step may require dataset connection remapping.
Note that all those steps can be automated using our Public API either within DSS instance (a
Macro) or in another application.
78
From Design Node to Automation Node
Design Data Sources Design Node Automation Node Production Data Sources
le
. .
d
Remote FS Remote FS
Bun
Dow
et et
c. c.
nloa
oad
etc etc
. .
d Bu
Upl
Cloud Storage Cloud Storage
ndle
Design
projects - Monitor projects in
production.
- Version control.
- Consume Deliverables/Consumption
Analytics(Dashboards) 79
Creating a Bundle
On the Design node, go to Bundles > Create your first Bundle
By default, only the project metadata are bundled. As a result, all datasets will come empty and models will
come untrained by default.
A good practice is to have the Automation Node connected to separate Production data sources. Dataset
connections can be remapped after uploading the bundle.
The Design node tracks all bundles. You can think of these as versions of your project.
80
People Operations Manager
81
Upload the bundle to the Automation Node Hands-on
Click Import Your First Project Bundle, choose the bundle file on your computer
and click Import
When importing the project, you may be prompted to remap connections and/or
Code Envs 82
Activate the bundle Hands-on
83
Finally, activate your Scenarios
After activating your first bundle, you need to go to the Scenario tab and activate the three
scenarios. You can trigger test them to make sure everything is OK.
You won't need to activate them again when updating the bundle as we will see in the next
slide.
84
Project versioning
85
Rolling back to a previous version
From the bundle list, You can always select an older version and click “Activate” to roll back to that
version.
86
Or… use the macro
DSS has a macro for automating pushing a bundle from a design node to an automation node.
For complicated workflows, you can also work directly w/ the DSS APIs and implement whatever logic
is needed.
87
DSS API Deployer/Node Overview
88
What is an API ?
An API is a software intermediary that allows two applications to talk to each other and
exchange data over the HTTP protocol.
ex : Getting weather data from Google API
An endpoint is a single path on the API and is contained within an API Service. Each endpoint fulfils a single
function.
89
The DSS API Node
We can design different kinds of REST Web Services in the Design Node. Those web services can receive ad-hoc
requests from clients and return "real time" responses over the HTTP protocol. Those REST services can be hosted on
separate DSS Instance: the API Node
Model Prediction
HTTP(S) REST
HTTP(S) GET/POST “{
Client Optional data
‘feature1’: 1, (Java)
Application transformation
“{ ‘feature2’: 2, Scoring
‘feature1’: 1 ‘feature3’: 3 (prep script)
‘feature2’: 2 }”
}”
XOR
Managed SQL db
API Node
Referenced SQL db
Service A v1
Service A v2 Service A v2 Service A v2
pred_endpoint
pred_endpoint pred_endpoint pred_endpoint
Service B v1 Service A v1
pred_endpoint_2 pred_endpoint
Service A v2
pred_endpoint
Model API
Deployer
Infrastructures
93
DEVELOPMENT PRODUCTION
API Services - The Model API Deployer
The model API Deployer is a visual interface to centralize the management of your APIs deployed on one or several
Dataiku API Nodes.
It can be installed locally (on the same node as Design or Automation node - not set up) or as a standalone node
(requires install)
If using a local API Deployer it can be accessed from the menu
94
Installing/Configuring an API Deployer Node
● Design/Automation nodes have a API Deployer built in. The local API Deployer can be used, or a
separate deployer can be set up. A separate deployer is typically recommended when many
Design/Automation nodes will be flowing into the same deployer, or when there are many API nodes or
deployments to manage.
● Install API Deployer Node via:
dataiku-dss-VERSION/installer.sh -t apideployer -d DATA_DIR -p PORT -l LICENSE_FILE
○ DATA_DIR and PORT are unique to the apideployer node. I.E. Do NOT use the same ones used for
the Design Node.
● Generate a new API key on the API Deployer (ADMIN > Security > GLOBAL API KEYS). Must have admin
access.
● On Every Design/Automation node that will connect ot the deployer:
○ Go to Administration > Settings > API Designer & Deployer
○ Set the API Deployer mode to “Remote” to indicate that we’ll connect to another node
○ Enter the base URL of the API Deployer node that you installed
○ Enter the secret of the API key
● The API deployer doesn’t directly access data so we don’t need to set up all the integration steps we did
on the design/automation node.
95
Installing/Configuring an API Node
● Install API Node via:
dataiku-dss-VERSION/installer.sh -t api -d DATA_DIR -p PORT -l LICENSE_FILE
○ DATA_DIR and PORT are unique to the api node. I.E. Do NOT use the same ones used for the Design
Node.
● The API Node doesn’t directly access data so we don’t need to set up all the integration steps we did on
the design/automation node.
96
Setting up Static Infrastructure on API Deployer
97
Using K8s for API Node Infra
API Deployer Node must be set up to work with K8s. Requirements are the same as having Design/Automation
node work with K8s. Details will be covered in a later section. Once Configured:
● Kubectl context: if your kubectl configuration file has several contexts, you need to indicate which one DSS
will target - this allows you to target multiple Kubernetes cluster from a single API Deployer by using
several kubectl contexts
● Kubernetes namespace: all elements created by DSS in your Kubernetes cluster will be created in that
namespace
● Registry host: registry where images are stored.
98
DSS API Deployer Workflow
99
Deploying our prediction model
The workflow for deploying the prediction model from your Automation node to an API
Node is as follows:
1) Create a new API Service and an API endpoint from your flow model
2) (Optional) Add a data enrichment to the model endpoint
3) Test the endpoint and push a new version to the API deployer
4) (Optional) Deploy our version to our Dev infrastructure
5) Test our version and push it to Production infrastructure
6) (As needed) Deploy a new version of the service with an updated model
7) (As needed) A/B test our 2 services versions inside a single endpoint
8) Integrate it in our real time prediction App.
100
Creating an Endpoint in a new Service
101
Push to API Deployer
102
Deploying your API service version to an infrastructure
103
Switching our deployment from dev to prod
Steps:
- In your dev Deployment, go to
Actions > Copy this deployment
- Select the copy target as the
PRODUCTION stage infrastructure
- Click on “Start now”
- Once the prod deployment is
done, check the Deployments
screen
104
Switching our deployment from dev to prod
We now have two deployments running, one on our Dev infrastructure and the other in Production
105
We have a real time prediction API !!
Go to Deployment > Summary > Endpoint URL
This url is the path to our API endpoint → this is what we will use in our third party apps to get model
predictions
You will get a different URL for each API node in your infrastructure. You can set up a load balancer to
round-robin the different endpoints.
106
Calling our real time prediction API from the
outside
107
Deploying a new version of the service
You can deploy a new version of your service at any time in the API Designer.
Click on your service and push a new version (‘v2’, etc) to the API Deployer.
108
Deploying a new version of the service
Go to your API Deployer, deploy the new version of your deployment to your dev infrastructure, select
“Update an existing deployment”
109
A/B testing service versions
In order to A/B test our 2 service versions, we will have to randomly dispatch the queries between
version 1 and version 2 :
1. Click on your Deployment > Settings
2. Set Active version mode to “Multiple Generations”
3. Set Strategy to “Random”
4. Set Entries to :
[
{"generation": "v2",
"proba": 0.6},
{"generation": "v1",
"proba": 0.4}
]
5. Save and update deployment
110
A/B testing service versions
Go back to the predictions webapp, run several times the same query and see how the same query is
dispatched between version 1 and 2 !
111
DSS Automating the API Deployer
Workflow
112
Create new API Service Version in Scenario
Go to your scenario’s steps
Add a step Create API Service Version → This will create a new API service version with the model
specified
113
Create new API Service Version in Scenario
114
Update API deployment in Scenario
Adding a step to Update our deployment in the API Deployer
115
Update API deployment in Scenario
Adding a step to Update our deployment in the API Deployer
Save and run the scenario, Go to the API Deployer and check that your new version
is deployed on dev infrastructure
116
Time for the Lab!
117
Module 5
Code Environments
118
DSS Code Environments
119
Code Environments in DSS
Customize your environment: code env !
→ In the case of Python environments, each environment may also use its
own version of Python
➢ DSS allows for Data Scientists to create and manage their own Python and R coding
environments, if given permission to do so by an Admin (Group Permissons)
➢ These Envs can be activated and deactivated for different pieces of code/levels in
DSS including
○ Projects, web apps, notebooks, and plugins
➢ To create/ manage Code Envs: Click the Gear -> Administration -> Code Envs
121
Code Environments in DSS
Creation
123
Code Environments in DSS
Installing Packages to your Env
➢ Permissions
○ Allow groups to use the code env
and define their level of use: i.e.
use only, can manage/update
➢ Container Exec
○ Build docker images that include
the libraries of your code env
○ Build for specific container
configs or all configs
➢ Logs
○ Review any errors in install code
env
125
Code Environments in DSS
Activating Code Envs
126
Using Non-standard Repositories
128
RStudio Integration - Overview
● DSS comes with Jupyter pre-installed for Notebooks use. This enables use of coding in:
○ Python
○ R
○ Scala
● Some Data Scientists prefer using different editors. Options are available for non-Jupyter use:
○ Embedded in DSS:
■ RStudio Server on DSS Host
■ RSTudio Server External to DSS Host
○ Other External Coding:
■ Rstudio Desktop
■ Pycharm
■ Sublime
● Note, execution is always done via DSS. External coding allows connecting to DSS via API to edit code and push
back into DSS.
129
RStudio Integration - Desktop
○ In Env Variables:
○ In ~/.dataiku/config.json
● Addins menu now has options for
interacting with dataiku
● Docs have a user tutorial for
working with these commands
130
RStudio Integration - External Server
● Rstudio on an External Host can be set up exactly like RStudio desktop to remotely work with DSS
● Additionally, you can embed RStudio Server in the DSS UI:
○ Edit /etc/rstudio/rserver.conf and add a line www-frame-origin = BASE_URL_OF_DSS
○ Restart RStudio Server
○ Edit DSS_DATA_DIR/config/dip.properties and add a line
dku.rstudioServerEmbedURL=http(s)://URL_OF_RSTUDIO_SERVER/
○ Restart DSS
● Rstudio can now be accessed via the UI.
● Login to RStudio Server as Usual
● Interact w/ DSS as described with Desktop Integration.
131
RStudio Integration - Shared Server
● If
○ Rstudio Server is on the same host as DSS
○ MUS is enabled
○ the same unix account is used for DSS and Rstudio, then
● An enhanced integration is available:
○ DSS will automatically install the dataiku package in the user’s R library
○ DSS will automatically connect DSS to Rstudio, so that you don’t have to declare the URL and API token
○ DSS can create RStudio projects corresponding to the DSS project
● Embed R Studio as described for the external host. RStudio has an “RStudio Actions” page where you can:
○ Install R Package
○ Setup Connection
○ Create Project Folder
132
Time for the Lab!
133
Module 6
DSS Maintenance
134
DSS Logs
135
DSS Logs
136
Main DSS Process Log Files
137
Main DSS Processes log files
Those logs are located in the DATA_DIR/run directory and are also accessible through the UI
(Administration > Maintenance > Log files)
138
Main DSS Processes log files
By default, the “main” log files are rotated when they reach a given size, and purged after a given number of
rotations. By default, rotation happens every 50 MB and 10 files are kept.
Those default values can be changed in the DATA_DIR/install.ini file (the installation configuration file)
139
Job Logs
Everytime you run a recipe a log file is generated. Go to a job page project. Click on the triangle ("play") sign
or type the “gj” keyboard shortcut
The last 100 job log files can be seen through the UI (see picture above). All the job logs files are stored in the
DATA_DIR/jobs/PROJECT_KEY/ directory.
140
Job Logs
When you click on a job log, you have the possibility to view the full log or downloading a job diagnosis.
When interacting with Dataiku support about a job, it is good practice to send us a Job diagnosis.
The DATA_DIR/jobs/PROJECT_KEY log files are not automatically purged. So the directory can quickly become big.
You need to clean old job log files once in a while. A good way to do this is through the use of Macros, which we will
disuss later. 141
Scenario Logs
- Each time a scenario is run in DSS, DSS makes a snapshot of the project configuration/flow/code, runs the
scenario (which, in turn, generally runs one or several jobs), and keeps various logs and diagnostic
information for this scenario run.
- The log files are located in the scenario section, in the tab last run
data_dir/PROJECT_NAME/VISUAL_ANALYSIS_ID/MODEL_GROUP_ID/sessions/SESSION_ID/MODEL_ID/tra
in.log
- These logs are not rotated, along w/ the other data in Visual
Analysis.
- You can manually remove files or delete analysis data
via a macro.
143
Audit Trail Logs
- DSS includes an audit trail that logs all actions performed by the users, with details about user id,
timestamp, IP address, authentication method, …
- You can view the latest audit events directly in the DSS UI: Administration > Security > Audit trail.
- Note that this live view only includes the last 1000 events logged by DSS, and it is reset each time
DSS is restarted. You should use log files( in DATA_DIR/run/audit) or external systems for real
auditing purposes. 144
Audit Trail Logs
145
Modifying Log Levels
● Log levels can be modifying by changing parameters in:
○ install_dir/resources/logging/dku-log4j.properties
● Configure by logger + by process.
○ Logger is typically 4th component you see in a log file, i.e.:
○ [2017/02/13-09:01:01.421] [DefaultQuartzScheduler_Worker -1] [INFO]
[dku.projects.stats.git] - [ct: 365] Analyzing 17 commits
○ Processes are what we discussed in DSS architecture, jek, fek, etc. dku applies to all processes.
○ You can split processes out to their own log file as well, i.e.
○ install_dir/resources/logging/dku- jek-log4j.properties
146
DSS Diagnostic Tool
You may have noticed the Diagnostic tool in the maintenance tab. When interacting with the DSS support
about an issue that is not related to a specific job, they may request this information.
This creates a single file that gives DSS support a good understanding of the configuration of your system, for
aiding in resolving issues.
148
Troubleshooting Backend Issues
UI Down
UI accessible
150
Troubleshooting UI Issues
Backend.log
151
Troubleshooting Notebook Issues
152
Troubleshooting Hadoop/Spark Issues
153
Working with DSS support
Forward to support:
154
Working with DSS support
For customer only, open a ticket on our support portal:
https://support.dataiku.com/ Or send an email to support@dataiku.com
Another channel for support is the Intercom chat that you can reach anywhere on dataiku.com
At times, logs or diagnosis might be too big to be attached to your request. You may want to use
dl.dataiku.com to transfer files
Try to internally manage your questions to the Dataiku support to avoid duplicates and to make sure
everybody on your team benefits from the answers.
155
Working with DSS support - Intercom
Intercom is the place to visit for usage questions. See example below. (Also, check the documentation :D )
Refrain from using any support channels for code review or administrating task over which we have no
control.
156
DSS Data Directory, Disk Space, +
BDR/HA
157
Dataiku Data Directory - DATA_DIR
The data directory is the unique location on the DSS server where DSS stores all its
configuration and data files.
Notably, you will find here:
- Startup scripts to start and stop DSS.
- Settings and definitions of your datasets, recipes, projects, …
- The actual data of your machine learning models.
- Some of the data for your datasets (those stored in dss managed local connections).
- Logs.
- Temporary files
- Caches
The data directory is the directory which you set during the installation of DSS on your server
(the -d option).
It is highly recommended that you reserve at least 100 GB of space for the data directory
158
Dataiku Data Directory - DATA_DIR
├── install.ini file to customize the installation of DSS
DATA_DIR ├── instance-id.txt uid of installed dss
├── R.lib R libraries installed by calling install.packages()from a R notebook. ├── jobs job logs and support files for all flow build jobs in DSS
├── analysis-data data for the models trained in the Lab part of DSS. ├── jupyter-run internal runtime support file for the Juypter notebook. cwd resides
├── apinode-packages code and config related to api deployments in here for all notebooks
├── bin various programs and scripts to manage DSS. ├── lib administrator-installed global custom libraries (Python and R), as well as
JDBC drivers.
├── bundle_activation_backups
├── local administrator-installed files for serving in web applications
├── caches various precomputed information (avatars, samples, etc)
├── managed_datasets location of the “filesystem_managed” connection
├── code-envs definitions of all code environments, as well as the actual packages.
├── managed_folders location of the “filesystem_folders” connection
├── code-reports
├── notebook_results query results for SQL / Hive / Impala notebooks
├── config all user configuration and data. license.json, etc
├── plugins plugins (both installed in DSS, and developed directly in DSS)
├── data-catalog data used for data catalog, table indices, etc
├── prepared_bundles bundles
├── databases several internal databases used for operation of DSS
├── privtmp temp files, don’t modify
├── dss-version.json version of dss you’re running
├── pyenv builtin Python environment of DSS
├── exports used to generated exports (notebooks, datasets, rmarkdown, etc)
├── run all core log files of DSS
├── html-apps
├── saved_models data for the models trained in the Flow
├── install-support internal files
├── scenarios scenario configs and logs
├── timelines databases containing timeline info of dss objects
├── tmp tmp files
└── uploads files that have been uploaded to DSS to use as datasets.
For more info:
https://doc.dataiku.com/dss/latest/operations/datadir.html
159
Managing DSS Disk Usage
- Various subsystems of DSS consume disk space in the DSS data directory.
- Some of this disk space is automatically managed and reclaimed by DSS (like
temporary files), but some needs some administrator decision and
management.
- For example, job logs are not automatically garbage collected, because a user or
administrator may want to access it an arbitrary amount of time later.
We will cover Macros in a bit but first let's see what other files we can delete in the
DATA_DIR
160
Managing DSS Disk Usage
- Some logs are not rotated (Jobs and Scenarios). It is then crucial to clean those
once in a while.
- In addition to those files, there are some other types of files that can be deleted
to regain some disk space.
1) Analysis Data. analysis-data/ANALYSIS_ID/MLTASK_ID/
2) Saved Models.
saved_models/PROJECT_KEY/SAVED_MODEL_ID/versions/VERSION_ID
3) Exports Files exports/
4) Temporary Files (manual deletion only) tmp/
5) Caches (manual deletion only) caches/
161
DSS Macros
Macros are predefined actions that allow you to automate a variety of tasks, like:
163
Backup/Disaster Recovery
• Periodic backup of DATADIR (contains all config/DSS state)
• Consistent live backup requires snapshots (disk-level for cloud and NAS/SAN, or OS-level with LVM)
• Industry standard backup procedure applies
164
Dataiku Data Directory - DATA_DIR
Dataiku recommends backing up the entire data directory. If, for whatever reason,
that is not possible, the following are essential to backup:
Include in Backups:
Optional:
R.lib managed_folders
data-catalog
analysis-data managed_results
exports
bin
plugins jobs
code-env
scenarios
config pyenvprivtmp
databases saved_model
install-support
scenarios
jupyter-run
lib timelines
local uploads
managed_datasets
165
HA and Scalability
LB
Shared
(or replicated w/ sync)
File System
LB
The number of API nodes required depends of the target QPS (Query Per Second) :
A A A ● Optimized models (java, spark, or SQL engines; see documentation) can lead
to 100 to 2000 QPS
● for non-optimized models, expect 5-50 qps per node
● If using an external RDBMS, it has to be HA itself
166
DSS Public API
167
The DSS Public API
The DSS public API allows you to interact with DSS from any external system. It allows you to perform a large
variety of administration and maintenance operations, in addition to access to datasets and other data
managed by DSS.
The public API Python client is preinstalled in DSS. If you plan on using it from within DSS (in a recipe,
notebook, macro, scenario, ...), you don’t need to do anything specific.
● To use the Python client from outside DSS, simply install it from pip.
○ pip install dataiku-api-client
168
The DSS Public API - Internal Use
When in DSS, you will inherit the credentials of the user writing the python code. Hence you don’t need an
API key. You can thus connect to the API in the following way:
169
The DSS Public API - External Use
On the contrary, when accessing DSS from the outside, you will need credentials to be able to connect. You
will need an API key. You can define API key in the settings of a project. Then one can connect to the API
through:
170
The DSS Public API- Generating API Keys.
There are three kinds of API keys for the DSS REST API:
● Project-level API keys: privileges on the content of the project only. They cannot give access to
anything which is not in their project. http://YOUR_INSTANCE/projects/YOUR_PROJECT/security/api
● Global API keys: encompass several projects. Global API keys can only be created and modified by DSS
administrators. http://YOUR_INSTANCE/admin/security/apikeys/
● Personal API keys : created by each user independently. They can be listed and deleted by admin, but
can only be created by the end user. A personal API key gives exactly the same permissions as the user who
created it. http://YOUR_INSTANCE/profile/apikeys/
171
DSS Public API- Generating Global API Keys
173
The DSS Public API - Python Examples
The Public API can help you interact with several parts of DSS:
✓ Managing users:
✓List users:
✓Create user:
✓Drop user:
174
The DSS Public API - Python Examples
The Public API can help you interact with several parts of DSS:
✓ Managing groups:
✓List groups:
✓Create group:
✓Drop group:
175
The DSS Public API - Python Examples
The Public API can help you interact with several parts of DSS:
✓ Managing connections:
✓List connections:
✓Create connection:
✓Drop connection:
176
The DSS Public API - Python Examples
The Public API can help you interact with several parts of DSS:
✓ Managing projects:
✓Create new project:
✓Handle permissions
177
HTTP REST API Example
import requests
import json
#create user
HOST = "http://<host>:<port>/public/api/admin/users/"
API_KEY = "<key>"
HEADERS = {"Content-Type":"application/json"}
DATA = {
"login": "user_x",
"sourceType": "LOCAL",
"displayName": "USER_X",
"groups": [
"GROUP_X"
],
"userProfile": "DATA_SCIENTIST"
}
178
r = requests.post(url=HOST, auth=("API_KEY", ""), headers=HEADERS, data=json.dumps(DATA))
Dataiku Command Line Tool - dsscli
dsscli is a command-line tool that can perform a variety of runtime administration tasks on DSS. It can be
used directly by a DSS administrator, or incorporated into automation scripts.
dsscli is made of a large number of commands. Each command performs a single administration task.
For example, to list jobs history in project MYPROJECT, use ./bin/dsscli jobs-list MYPROJECT
179
Time for the Lab!
180
Module 7
181
CGroups in DSS
182
DSS 5.0 brings some new solutions on resource management
● Resource control : full integration with the Linux cgroups functionality in order to restrict resource usages
per project, user, category, … and protect DSS against memory overruns
● Docker : Python, R and in memory Visual ML recipe can be ran in Docker container :
○ Ability to push computation to specific remote host
○ Ability to leverage host with different computing capabilities like GPU
○ Ability to restrict the used resources (cpu, memory, …) either per container
○ But no global ressource control and the user has to decide on which host (no magic distribution)
○ Native ability to run on a cluster of machines. Kubernetes automatically places containers on machines depending on resources availability.
○ Ability to globally control resource usage.
○ Managed cloud Kubernetes services can have auto-scaling capabilities.
● This feature allows control over usage of memory, CPU (+ other resources) by most processes.
● The cgroups integration in DSS is very flexible and allows you to devise multiple resource allocation
strategies:
● Limiting resources for all processes from all users
● Limiting resources by process type (i.e. a resource limit for notebooks, another one for webapps, …)
● Limiting resources by user
● Limiting resources by project key
● cgroups enabled on the linux DSS server(this is the default on all recent DSS-supported
distributions)
● DSS service account needs to have write access to one or several cgroups
● This normally requires some action to be performed at system boot before DSS startup, and can be
handled by the DSS-provided service startup script
● This feature works with both regular and multi user security
● The applicable limits are the one made available by Linux cgroups (check linux doc for more
information)
○ memory.limit_in_bytes : sets the maximum amount of user memory (including file cache). If no units are specified, the
value is interpreted as bytes. However, it is possible to use suffixes to represent larger units — k or K for kilobytes, m or
M for megabytes, and g or G for gigabytes
○ cpu.cfs_quota_us and cpu.cfs_period_us : cpu.cfs_quota_us specifies the total amount of time in microseconds for
which all tasks in a cgroup can run during one period as defined by cpu.cfs_period_us.
©2018 dataiku, Inc.
189
Using cgroup for resource control
Server side setup preparation
● In most Linux, the “cpu” and “memory” controllers are mounted in different hierarchies, generally :
○ /sys/fs/cgroup/memory
○ /sys/fs/cgroup/cpu
● You will first need to make sure that you have write access to a cgroup within each of these
hierarchies.
● To avoid conflicts with other parts of the system which manage cgroups, it is advised to configure
dedicated subdirectories within the cgroup hierarchies for DSS. I.E.
○ /sys/fs/cgroup/memory/DSS
○ /sys/fs/cgroup/cpu/DSS
●Note that these directories will not persist over a reboot. You can modify the DSS startup script
(/etc/sysconfig/dataiku[.INSTANCE_NAME]) to create these.
○ DIP_CGROUPS and DIP_CGROUP_ROOT
191
JVM Memory Model
➢ You need to tell Java how much memory it can allocate
➢ -Xms => Minimum amount of memory allocated for the heap
(Your java process will never consume less memory than this limit + a fixed
overhead)
➢ -Xmx => Maximum amount of memory allocated for the heap
(Your java process will never consume more memory than this limit + a fixed
overhead)
➢ Java allocate memory when it needs…and deallocate memory if it didn't use it for
a while.
○ For that Java uses a Garbage Collector which periodically scans the Java
program to find the unused memory blocks and reclaim them.
➢ If your program requires more memory than the authorized maximum (Xmx), the
program will throw an OutOfMemory exception...but before that the Garbage
Collector will make its best to find the memory your program is asking for
○ More often that not, the Java process seems stuck before it throws an
OutOfMemory exception because all CPU cycles of the Java process are
burned by the GC (which try to find memory for you) rather than by the actual
program.
192
Java Memory Settings
If you experience OOM issues, you may want to modify the memory settings in the data_dir/install.ini file:
● stop dss
● [javaopts]
● backend.max = Xg
○ Default of 2g, global
○ For large production instances, may need to be as high as 20g
○ Look for “OutOfMemoryError: Java Heap Space” or “OutOfMemoryError: GC Overhead limit
exceeded” before “DSS Startup: backend version” in backend.log
● jek.xmx = Xg
○ default of 2g, multiplied by number of jek
○ increase incrementally by 1g
○ Look for “OutOfMemoryError: Java Heap Space” or “OutOfMemoryError: GC Overhead limit
exceeded” in job log
● fek.xmx =Xg
○ default of 2g, multiplied by number of fek
○ increase incrementally by 1g
● Restart DSS
● Note: You should typically only increase these per the instructions of Dataiku. 193
Other Processes
Spark Drivers:
● Configure Spark Driver Memory
○ spark.driver.memory
○ or cgroups
● Notebooks
○ Unload notebooks
○ Admins can force shutdown
○ use cgroups
○ Or, run them in k8s
● In Memory ML
○ use cgroups
● Webapps
○ use cgroups
194
Time for the Lab!
195
The End!