KEMBAR78
Dataform | PDF | Apache Spark | Time Series
0% found this document useful (0 votes)
51 views17 pages

Dataform

Uploaded by

jonnyman0807
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
51 views17 pages

Dataform

Uploaded by

jonnyman0807
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Our Guide to Open

Source in Data Science


The Open Source Security

Revolution
While open-source software is often dismissed as being insecure, the truth is
that open-source software is one of the most secure on the market. An open
community-based approach incentivizes hundreds, if not thousands of
developers to monitor for vulnerabilities in a program (Bluespark). This means
that vulnerabilities and security flaws are found and fixed much faster than in
proprietary software (PCWorld). Moreover, open-source software allows
Over the past few decades, digital technologies have completely organizations to perform security audits on any software they plan to onboard
in their processes, as opposed to “black box” proprietary software (inc).
transformed our way of life. From how we communicate to the
way we conduct business, software has disrupted how value is
generated today. This third industrial revolution is where
information technology and digital solutions have become widely Quality
used to automate production and improve productivity (World
Economic Forum). According to Red Hat, the leading cause for organizations to adopt open-
source software is because it’s considered to be higher quality than
proprietary solutions. Its quality is driven by the nature of open source
Arguably, the most significant catalyst to the adoption and collaboration, where practitioners have input in how the tools they use are
development of digital technologies is open source software, designed. This “built for the people by the people” approach encourages
which has led to many of the most exciting innovations of the better alignment between open source software designers, and their end
21st century (zdnet). At a simple level, open-source software is a users. Moreover, this means that teams across the organization can adopt,
type of software where the source code is released with flexible build, and customize software with the same open-source tools. This helps
licensing so that it can be accessed, used, distributed, and teams avoid falling into the trap of silos where teams use different sets of
proprietary software. Finally, the free and open nature of open-source
modified by other developers. As such, open source software has
software means that practitioners have a faster speed to value as they can
introduced a paradigm shift: Organizations can now build easily repurpose existing code-bases for their specific needs (Xorlogics).
secure, high-quality software that gives them more flexibility
while hiring and retaining the best talent there is. The
advantages of open-source software are many, which is why
now more than ever, organizations across all industries are
adopting open-source technologies (Red Hat).

Our Guide to Open Source in Data Science 1


Cost Talent and Skills
One of the most obvious benefits of open source adoption is total reduced Arguably one of the most important aspects of open-source software is how it
costs. Open-source software streamlines costs for organization across the intersects with talent acquisition and retention. In short, organizations that
software adoption flow. For starters, organizations can quickly adopt a contribute to and work with open source projects attract and retain better
solution and experiment with it, avoiding time spent on vendor-led proof of talent. For starters, open source tools and programming languages have
concepts and requests for proposals. Moreover, open-source software become the standard in academia and industry alike, facilitating skill sharing
minimizes the amount of licensing and maintenance costs, as open-source and development.
libraries are upgraded for free. Most importantly, open-source software is
extensible and customizable, whereas proprietary software locks “It means we build better software, write better code, our engineers are
organizations in with vendors even when it ceases to meet the demands of able to work with more pride, and we’re able to retain the world’s best
their use cases (O’Reilly). engineers because they know they can open source their work.”

James Pearce
Flexibility Engineering Director at Facebook (VentureBeat)

One of the key differentiators between proprietary and open-source software


is flexibility and customization. Ultimately, proprietary software is controlled Most importantly, open source contribution enables organizations to attract,
and managed by its developers, whereas open source software has much and hire the best talent. According to Wipro, 80% of organizations who work
more flexible licensing. This enables organizations to customize software for with and contribute to open source projects got into it specifically to attract
the workflows and provides them more control over the tools and solutions and retain talent. It’s no surprise then that 86% of information technology
they develop (inc). Moreover, open-source software is interoperable, meaning leaders believe that the most innovative companies are adopting and
that it can work with a variety of data formats, and is designed for cloud and investing in open source software today (Red Hat).
cloud-native technologies. Finally, open source software enables
organizations to avoid vendor lock-in and allows them to test and try software
before committing to a solution (InfoWorld).

Our Guide to Open Source in Data Science 2


Open-Source Software
in Data Science
Now, we stand on the cusp of a fourth industrial revolution (Salesforce) that While not every member of the organization is required to learn Python or R to
will be defined by the intersection of data-driven and data-generating become data-driven (The Data Leader’s Guide to Upskilling), these
technologies such as artificial intelligence, machine learning, the internet of technologies underpin the growth and adoption of data science in an
things, and more. Open-source technologies will continue to empower organization. This guide will demystify the most popular data science and
organizations to make the most of their data and create transformative machine learning packages and tools in R and Python and uncover their use
solutions, processes, and products with machine learning and data science. cases throughout an organization.

The two most commonly used open source programming languages in data
science are R and Python. There’s a lot of discussion on the difference
between the two languages as they provide thousands of open source data
science and machine learning packages. At DataCamp, we’ve built our entire
data science curriculum around empowering people and organizations to
become data fluent by teaching the most popular and powerful open source
frameworks for both languages.

Our Guide to Open Source in Data Science 3


Data Manipulation
Python R
pandas is one of the most popular Python packages in data science for Considered as the core packages of the tidyverse for data manipulation,
pandas dplyr tidyr
working with tabular data due to its ease of use, ability to work with large these open source packages offer a host of tools and functions to read,
quantities of data, built-in plotting and aggregation tools, and more. It readr tibble manipulate, and tidy data. The readr package allows practitioners read in a
supports reading and writing a variety of data types, from CSV and Excel variety of data types, whereas the tidyr, dplyr, and tibble packages offer a
files to SQL and more. suite of tools to manipulate, clean, tidy, and work with data efficiently.

GeoPandas is built on top of pandas and extends pandas capabilities to The data.table package is used for working with tabular data in R and is
geopandas data.table
easily work, process, manipulate, and visualize geo-spatial data in Python. widely known for its speed of execution on larger datasets and its intuitive
syntax.
NumPy is one of the most elemental packages in Python, as many other
numpy packages are built on top of it, including pandas and SciPy. It allows the xts is one of the most popular packages for working with time series data in
formation, transformation, and manipulation of arrays, among other xts R. It allows a host of functions for working with time series data, such as
operations. indexing, resampling, handling missing data, and more.

scipy SciPy stands for scientific Python, and contains a set of scientific tools and
techniques for statistics, linear algebra, data processing, and more.

Our content Our content


Courses Courses
Introduction to Python Pandas Foundations Manipulating DataFrames with pandas Introduction to the Tidyverse Data Manipulation with dplyr Exploratory Data Analysis in R
Working with Geospatial Data in Python Analyzing US Census Data in Python Exploratory Data Manipulating Time Series Data with xts and zoo in R Data Manipulation with data.table in R
Analysis in Python
Tracks
Tracks Data Manipulation with R (5 courses)
Data Manipulation with Python (4 courses)

Use Cases Automate legacy Excel workflows Conduct time-series analysis on sales data Analyze traffic rates for city planning

Conduct Covid-19 contact tracing analysis Optimize business processes with various constraints

Our Guide to Open Source in Data Science 4


Data Visualization
Python R

matplotlib Matplotlib is the most popular data visualization package on Python, The most popular data visualization package for R, this tidyverse package
ggplot2
enabling comprehensive creation and customization of different types of allows the creation and customization of a range of data visualizations. It
data visualizations in Python. also offers a range of extensions to visualize unique data structures like
network data, quickly develop themes, animate plots, and more.
Seaborn is a data visualization package built on top of Matplotlib that
seaborn
allows for the easy creation of highly aesthetic plots in Python Originally a Javascript package, the Leaflet package provides the ability to
leaflet easily visualize geospatial data in R with robust styling capabilities.
Bokeh and Plotly are interactive visualization libraries that allow for the
bokeh
creation and customization of interactive plots and widgets that can be Rbokeh and Plotly are interactive visualization libraries that allow for the
plotly rbokeh
published in web pages creation and customization of interactive plots and widgets that can be
plotly published in web pages.
follium Follium is built on top of Javascript's Leaflet package, which provides the
ability to easily visualize geospatial data in Python with robust styling
capabilities

Our content Our content


Courses Courses
Introduction to Data Visualization with Matplotlib Introduction to Data Visualization with Seaborn Introduction to Data Visualization with ggplot2 Interactive Data Visualization with rbokeh
Interactive Data Visualization with Bokeh Visualizing Time Series Data in Python Interactive Data Visualization with plotly in R Interactive Maps with leaflet in R
Visualizing Geospatial Data in Python Introduction to Data Visualization with Plotly in Python
Tracks
Tracks Data Visualization with R (3 courses)
Data Visualization with Python (5 courses)

Use Cases Visually compare multiple columns using subplots Create presentation ready-plots with three lines of code

Build free interactive dashboards to track key performance indicators hosted on web-pages Visualize Covid-19 cases across the world

Our Guide to Open Source in Data Science 5


Data Cleaning
Python R

recordlinkage Built on top of pandas, the Record Linkage library allows the linking and reclin is an R library that allows linking and merging between of two or more
reclin
merging of two or more data sources. It helps to match and deduplicate data sources. It helps to match and deduplicate records that are believed to
records that are believed to be the same entity. be the same entity.

Missingno allows the quick visualization and inspection of missing data, forcats The forcats package is a tidyverse package that enables practitioners to
missingno
enabling data scientists to determine the root cause of missingness. quickly solve common problems when working with categorical data, such
as re-ordering, collapsing, and reordering categories.

The naniar and VIM packages allow the quick visualization and inspection of
naniar
missing data, enabling data scientists to determine the root cause of
vim missingness.

Our content Our content


Courses Courses
Cleaning Data in Python Dealing with Missing Data in Python Cleaning Data in R Dealing with Missing Data in R Handling Missnig Data with Imputations in R
Working with Data in the Tidyverse
Tracks
Importing and Cleaning Data with Python (5 courses) Tracks
Importing and Cleaning Data with R (4 courses)

Use Cases Consolidate and deduplicate disparate organizational data and establish trust in data quality

Determine the root cause of missing data in a database Clean the results of a survey

Our Guide to Open Source in Data Science 6


Probability and Statistics
Python R

pymc3 PyMC3 is one of the most popular Python packages for probabilistic MASS is an R library that provides a host of datasets and functionalities
mass
programming. It provides a host of tools to work with probabilistic for statistical analysis, including regression models, statistical tests, and
programming in Python, including modeling, simulation, transformations more.
and more.
stats is an R package that provides a comprehensive set of functions and
Statsmodels is a Python library that provides a host of statistical functions stats
statsmodels capabilities, including regression models, plotting functionality, time series
and capabilities, including regression models, time series analysis, analysis, experiment design, and more.
experiment design, and more.
Part of the tidyverts set of packages for time series forcasting, this
fable
The arch package contains a set of functions for forecasting highly volatile
arch package offers a range of tools and functions for easily performing and
time series data. Often used in finance, the arch package enables evaluating common time series forecasting models.
practitioners to model, evaluate, and work with GARCH models in Python,
which are popular for forecasting volatile time series data. powerMediation The powerMediation package provides a robust set of tools for designing,
running, and evaluating statistical experiments in R.

Our content Our content


Courses Courses
Introduction to Regression with statsmodels in Python Statistical Thinking in Python Introduction to Statistical Modeling in R Correlation and Regression in R
ARIMA Models in Python GARCH Models in Python Customer Analytics and A/B Testing Foundations of Inference Foundations of Probability in R Experimental Design in R
in Python Survival Analysis in R

Tracks
Tracks
Time Series with Python (5 courses) Statistics Fundamentals with Python (5 courses)
Time Series with R (6 courses) Statistics Fundamentals with R (5 courses)

Use Cases Determine the best performing webpage enhancement with an A/B test Forecast demand with supply chain planning

Measure the volatility of a stock portfolio Evaluate the results of a clinical trial in Pharmaceuticals

Our Guide to Open Source in Data Science 7


Machine Learning
Python R
scikit-learn scikit-learn is the most popular and versatile machine learning framework Similar to the tidyverse, tidymodels is a collection of R packages designed for the
across any programming language. It includes a host of tools, functions, and tidymodels
machine learning workflow. It contains a range of packages such as rsample for
techniques, covering the entirety of the machine learning pipeline from better data splitting and sampling, efficient modelling with parsnip ,
exploratory analysis, to data pre-processing, modeling and training, and hyperparameter tuning with the tune package, and more.
accuracy evaluation.
Caret is one of the most popular machine learning packages in R. It includes a host of
CatBoost, LightGBM, and XGBoost are machine learning libraries for gradient caret
catboost tools, functions, and techniques, that cover the entirety of the machine learning
boosting on decision trees. While there are differences between them when it
pipeline from exploratory analysis, to data pre-processing, modeling and training, and
lightGBM comes to training speed, how they handle missing values, feature importance
accuracy evaluation.
methods, and other key technical components; gradient boosted trees have
xgboost
become widely popular for machine learning on tabluar data. Just like its Python counterpart, XGBoost is a machine learning library for gradient
xgboost
boosting on decision trees, a widely popular technique for machine learning on
TensorFlow is an end-to-end deep learning framework developed by Google that
tensorflow tabluar data.
provides a comprehensive set of tools for deep learning model building,
evaluation, and deployment. Metrics is one of the most popular R packages for evaluating the performance of a
metrics
Keras is a library built on top of TensorFlow, meant to reduce the barrier to range of machine learning predictions, from classification to regression, time series
keras forecasting, and more.
working with deep learning by simplifying model building, evaluation, and
deployment.
rpart is a popular package for working with tree-based models in R. It allows
rpart
PyTorch is another popular deep learning framework developed by Facebook. It predicting categorical and continuous outcomes by developing easy to visualize
pytorch
is widely used in research and provides a comprehensive toolset for deploying decision rules.
deep learning models in production.

Our content Our content


Courses Courses
Supervised Learning with scikit-learn Unsupervised Learning in Python Cluster Analysis in Python Supervised learning in R: Classification Supervised learning in R: Regression Unsupervised
Introduction to Tensorflow in Python Extreme Gradient Boosting with XGBoost Learning in R Machine Learning with caret in R Machine Learning for Marketing Analytics in R
Introduction to Deep Learning with Keras Introduction to Deep Learning with PyTorch Tree-based models in R
Machine Learning for Marketing in Python

Tracks Tracks
Machine Learning Fundamentals with Python (5 courses) Machine Learning Fundamentals in R (4 courses)
Machine Learning Scientist with Python (23 courses) Machine Learning Scientist with R (15 courses)

Use Cases Predict customer churn with classification models Predict housing prices with regression models

Detect customer segments with unsupervized learning Develop an image recognition system to digitize documents Our Guide to Open Source in Data Science 8
Natural Language Processing
Python R

gensim Gensim is a fast and efficient Python library for topic modeling, document tidytext provides a suite of NLP functions to make text mining tasks easier,
tidytext
comparison, topic identification, and more on large text datasets. more effective, and consistent with the tidyverse toolset. It allows practitioners
to efficiently perform tasks like tokenization, sentiment analysis, remove
stotpwords, and more.
spaCy is an open source library for Natural Language Processing that performs a
spacy
range of NLP tasks from tokenization, part-of-speech tagging, lemmatization, text The topicmodels package provides a host of topic modeling functions aimed at
topicmodels
classifciation, and more. identifying and summarizing text and categorizing documents.

NLTK is an open source Python library that provides a host of NLP tools for data stringr Stringr is one of the most popular packages in R for working with text data.
nltk
preprocessing, classification, parsing text, sentiment analysis, and more. Part of the tidyverse packages, it allows a host of operations on text data such
as string detection, string subsetting, joining and splitting strings, and more.

Our content Our content


Courses Courses
Introduction to Natural Language Processing in Python Sentiment Analysis in Python Introduction to Natural Language Processing in R Introduction to text analysis in R
Advanced NLP with spaCy Feature Engineering fo NLP in Python Machine Translation Intermediate Regular Expressions in R String Manipulation with stringr in R Text Mining with
in Python Bag of Words in R

Tracks Tracks
Natural Language Processing in Python (6 courses) Text Mining with R (4 courses)

Use Cases Categorize documents based on topic Pre-process text data for deep learning models

Perform sentiment analysis on customer tweets

Our Guide to Open Source in Data Science 9


Application Specific Packages
Python R

networkx NetworkX is one of the most popular Python packages for creating, manipulating, igraph is one of the most popular packages for creating, manipulating, visualizing
igraph
and studying network structures in Python. and studying network structures in R.

tweepy Tweepy is an easy to use Python library for accessing and manipulating twitter rtweet rtweet is an easy-to-use R library for accessing and manipulating Twitter data.
data.
PyPorfolioOpt is a popular Python package for portfolio analysis, optimization, and qrm QRM is a popular package for portfolio analysis, optimization, and quantitative risk
pypfopt
quantiative risk management in Python. management in R.

Scikit-image is an open source library containing a collection of image processing magick Built on top of ImageMagick STL, a popular open source library for working with
skimage
algorithms such as feature dettection, filtering, segmentation, and more. image data, magick provides a comprehensive set of functionalities to work with
and process image data in R.
OpenCV is one of the most popular computer vision libraries on Python that
opencv
contains a wide range of tools for working with and processing image data.

Our content Our content


Courses Courses
Introduction to Network Analysis in Python Intermediate Network Analysis in Python Network Analysis in R Case Studies: Network Analysis in R Network Analysis in the Tidyverse
Analyzing Social Media Data in Python Image Processing in Python Quantitative Risk Quantiative Risk Management in R Analyzing Social Media Data in R
Management in Python

Tracks Tracks
Image Processing with Python (3 courses) Marketing Analytics with R (6 courses)
Applied Finance in Python (4 courses) Applied Finance in R (7 courses)

Use Cases Optimize supply chain flows with network analytics Analyze the popularity of a service in a given geographical location

Automatically optimize a stock porfolio Perform optical character recognition for document digitization

Our Guide to Open Source in Data Science 10


Reporting and Communicating Data
Python R

dash Dash is a highly robust framework for building rich, interactive, and One of the most popular tools in the R data science stack, the R
R Markdown
customizable data visualization apps that can be rendered and shared easily Markdown Notebook is similar to a Jupyter Notebook in Python. It
on a web browser. allows practioners to analyze, describe, share, and reproduce their
analyis in a friendly notebook interface.
Streamlit is another highly popular framework for quickly building and sharing
streamlit
data apps. While it is highly useful for sharing data insights on a variety of use Shiny is one of the most popular packages in data science and in the R
shiny
cases, it is especially used for sharing machine learning model results and data science stack. It provides the ability to create highly robust
analysis. dashboards and web apps that can be rendered and easily shared on a
web browser.
Arguably the most popular tool in Python for data science, Jupyter Notebooks
jupyter
shinydashboard is a library built on top of shiny that makes it easy to
notebooks are the IDE of choice for 74% of data scientists (Kaggle). Jupyter Notebooks are shinydashboards
an open source web application that allows creating and sharing documents develop data visualization dashboards with shiny.
containing live code, visualizations, and narrative text. They've completely
flexdashboards flexdashboard is an open source R library that makes it easy to develop
revolutionized how data scientists share their work, and will continue to lower
dashboards with RMarkdown.
the barrier for data democratization (DataCamp).

Our content Our content


Courses Courses
Building Dashboards with Dash and plotly (coming soon) Building Web Applications with Shiny in R Building Dashboards with shinydashboard
Case Studies: Building Web Applications with Shiny in R Reporting with R Markdown
Projects
Building Dashboards with flexdashboard
Comparing Search Interest wiith Google Trends Exploring the evolution of lego
Bad passwords and the NIST guidelines Analyzing TV Data Tracks
Shiny Fundamentals (4 courses)

Use Cases Live tracking of team or company OKRs with a web-based dashboard Sharing machine learning experiment results with business stakeholders

Automating legacy Excel workflows Posting and sharing data analysis results to business stakeholders

Onboarding new hires on data processess with notebook tutorials

Our Guide to Open Source in Data Science 11


Big Data
Python R

pyspark Apache Spark is an open-source distributed data processing framework that fst provides a fast and flexible way to serialize data frames. It allows for faster
fst
can perform data processing tasks on very large datasets. PySpark provides a read and write times, and enables practitioners to work more quickly with big
Python API for working with Spark. data in R.

Streamlit is another highly popular framework for quickly building and sharing
dask
data apps. While it is highly useful for sharing data insights on a variety of use Spark is an open-source distributed data processing framework that can
sparklyr
cases, it is especially used for sharing machine learning model results and perform data processing tasks on very large datasets. sparklyr provides an R
analysis. API for working with Spark.

Our content Our content


Courses Courses
Introduction to PySpark Cleaning Data in PySpark Machine Learning with PySpark Introduction to Spark with sparklyr in R Scalable Data Processing in R
Building Recommendation Engines with PySpark Parallel Programming with Dask in Python Parallel Programming in R

Tracks Tracks
Big Data with PySpark (6 courses) Big Data with R (5 courses)

Use Cases Perform market basket analysis on millions of customer e-commerce transactions Quickly analyze millions of Covid-19 infections

Develop a recommendation engine on large movie streaming datasets

Our Guide to Open Source in Data Science 12


Data Engineering
Python R

airflow Developed by Airbnb, Apache Airflow is an open-source tool for data workflow While Python is more known for data engineering, jsonlite and xml2 are R
jsonlite
automation. It is highly scalable and extensible, and works well with a variety packages that provide a host of tools for working with, processing, and
of common tools like cloud providers, databases, Salesforce, and more. xml2 transforming JSON and XML files in R. They allow practitioners to easily work
with web-data, and are optimized for building pipelines with R.
SQLAlchemy is a comprehensive SQL toolkit for Python that enables mapping
sqlalchemy
SQL tables to user-defined Python objects, making it easy to create tables, map The odbc package provides a wide range of functionality for connecting to,
odbc
relations between them, and ingest data all through Python. and working with databases in R. It provides support for various types of
databases, from MySQL, PostgreSQL, SQL Server, SQLite, BigQuery, Redshift
SQLite3 provides a SQL interface in Python that allows practitioners to connect and more.
sqlite3
to a SQL database and execute SQL code within Python.
dbi DBI provides a SQL interface in R that allows practitioners to connect to a SQL
database and execute SQL code within R. There are many packages built on
top of DBI that make it even easier to connect to databases in R, such as
RPostgreSQL , RMySQL , and ROracle .

Our content Our content


Courses Courses
Introduction to Data Engineering Introduction to Airflow in Python Intermediate Importing Data in R Working with web data in R Introduction to Relational
Introduction to Databases in Python Streamlined data ingestion with pandas Databases in SQL
Introduction to Importing Data in Python

Tracks Tracks
Data Engineer with Python (25 courses) R Programmer (12 courses)

Use Cases Scheduling a daily data analysis workflow Extract, transform, and load data into a database

Scrape web pages and load its contents into a database

Our Guide to Open Source in Data Science 13


How open source is Open-source software also allows data scientists to create highly flexible tools
and frameworks that are tailor-made for their organizations’ workflows. This

driving data
enables organizations to simplify complex data processes, allowing anyone
with basic coding skills to work with data. For example, DataCamp’s data
science team has open sourced R and Python packages, dbconnectR and

fluency
dbconnect-python , that simplify connecting to databases, enabling data
consumers to access data with limited R or Python skills. Airbnb developed an
R package named rbnb , which allows teams to easily access and move data
within Airbnb’s data infrastructure, easily create branded visualizations,
access different RMarkdown report templates, and access custom functions to
optimize specific Airbnb data workflows.
Just as the open source revolution catalyzed the software revolution, it is also
paving the way toward data democratization and organization-wide data
fluency. This is especially accelerated by the open, collaborative nature of
open source data science (Anaconda) and the speed of innovation it allows Discover Open Source at DataCamp
(TechRepublic).

Data-fluent teams around the world are using open source data science tools
and technologies to democratize data by providing better access to data,
streamlining data processes, creating time-saving tools, and upskilling their
people. This ultimately results in equipping stakeholders across an
organization with the tools to make data-driven decisions.

For example, Airbnb and Spotify open sourced their proprietary tools Airflow
and Luigi, enabling organizations to easily and scalably build data pipelines
and provide better, more resilient access to data. Lyft’s Amundsen allows
organizations to discover, update, and understand the changes that occur to
their data, building trust for data-driven teams. Netflix embraced the Jupyter
Notebook (Netflix), using it as a central tool within many of its processes
through the use of notebook templates. This allows data-driven teams
comprising business analysts to data engineers to easily work with data.

Our Guide to Open Source in Data Science 14


Upskilling is a key
component of open
source data science

While the benefits of streamlining data processes and developing time-saving “We’ve trialed a number of other online learning solutions, but
tools cannot be understated, these tools require the necessary skills across only DataCamp provides the interactive experience that
teams. This is why upskilling is a key component of open source driven data
reinforces learning. Just as you wouldn’t trust a surgeon who
democratization. For example, Airbnb launched a Data University aimed at
providing thousands of its employees the necessary skills to work with open had watched some videos about surgery, you couldn’t trust a
source software for data science. Bloomberg uses DataCamp as part of a developer who has watched some videos about programming.
blended learning environment to teach data analysis with Python and There’s a great depth of content on the site. It’s great for
empower employees of all skill levels to write data-driven financial news absolute beginners, but there is very advanced content for
stories. DataCamp partnered with a major global retail bank to transition their
users with more experience.”
risk analytics department from SAS to Python, reducing dependence on
licensed legacy software and focusing on future-proof open source Python
packages like pandas and scikit-learn. As organizations look to scale their Sarah Schlobohm
data science with better open source tooling, closing the skills gap will need Senior Analytics Manager, Global Risk Analytics, HSBC
to go hand in hand with these efforts.

Our Guide to Open Source in Data Science 15


DataCamp’s proven learning Practice
methodology for learning The next step in DataCamp’s proven learning methodology is to practice all
the information retained in courses. Using practice mode, learners can
open source data science practice what they’ve learned with short challenges to test critical concepts.
With over 3,400 practice questions, learners can practice their skills across
various technologies and topics. Our mobile app is the perfect way to
practice and learn on the go.

Beyond courses and tracks, DataCamp’s proven learning methodology Apply


provides a cyclical process for learning and retention. This learning
methodology enables learners across the data fluency spectrum to assess Once skills have been assessed, cultivated through courses, and sharpened
their skills and identify gaps, develop a learning plan based on these gaps, through practice, learners are ready to apply their skills in a project-based
practice skills, and apply them in a real-world setting. Experienced data environment. With DataCamp projects, learners can solve a variety of real-
scientists can upskill on new open source packages and tools in their target world R and Python data science projects. Learners can opt for guided
domain, and domain experts can learn the fundamentals of data literacy and projects, where they can follow step-by-step tasks and receive helpful
data science to get started with open source technologies. feedback as they apply their newfound skills. They can also opt for unguided
projects, which are open-ended and offer a variety of possible solutions along
with a live-code-along video to follow how an expert data scientist would
Assess approach a solution.
Effective learning starts with understanding skill gaps and strengths. With
DataCamp Signal™, learners can understand specific skill gaps they have DataCamp’s entire learning experience is easy to implement and manage for
across various topics and tools. From data literacy assessments like teams of any size, with an administrator dashboard that allows custom
understanding and interpreting data to programming and machine learning learning paths based on roles and departments, advanced analytics and
assessments in R or Python, our 10-minute adaptive evaluations provide insights to measure the impact of online learning, and seamless SSO and LMS
learners with personalized skill gaps and learning paths to address their skill integrations. Teams benefit from our Customer Success Managers, who
gaps. partner with organizations to accelerate learning adoption and provide
valuable recommendations to help achieve organization-wide data fluency.
We have more than 7 million learners around the world—and we’re just
Learn getting started. Close the talent gap. Visit datacamp.com.
DataCamp’s growing course library houses more than 350 expert-led, hands-
on courses across various technologies and domains for all data skills and
levels. Learners can hit the ground running with our learn-by-doing-approach
—our bite-sized videos and interactive coding exercises allow them to start
working with their preferred tool and topic right in the browser.

You might also like