KEMBAR78
S2-Programming_with_Data_Computational_Physics.pdf
Programa de Física
Docente: Carlos Andrés Vidal Betancourt
Física Computacional 1
S2 - Programming with Data
Overview
Chapter 2 – Programming with Data
2.1 Introduction
2.2 The computing environment
2.3 Best practices
2.4 Data-centric coding
2.5 Getting help
2.6 Conclusion
Sean Raleigh
Westminster College
… a quote …
2.1 Introduction
<<The most important tool in the data science tool belt is the computer. No amount of statistical or mathematical
knowledge will help you analyze data if you cannot store, load, and process data using technology.>>
<<The aim of this chapter is to introduce you to some aspects of computing and computer programming that are
important for data science applications.>>
<<A project that can be reproduced is one that bundles together the raw data along with all the code used to
take that data through the entire pipeline of loading, processing, cleaning, transforming, exploring,
summarizing, visualizing, and analyzing it.>>
https://github.com/VectorPosse/Programming_with_Data
2.2 The computing environment
<<The choice of hardware for doing data science depends heavily on the task at hand.>>
ASUS
Rock Strick
X399
Processor:
AMD RYZEN - Threadripper
16 Cores / 32 Threads.
f=4.5 GHz / Cache 132 MB
RAM - DDR5
64 GB / 3.2 GHz
Video Card – NVIDIA
Titan 2070X – 8 GB DDR6
21 GHz / 2304 Cores
SSD 1TB
Read/Write 3.5 GB/s
Master cooler
(Graphene)
Hardware
UPS
10 kVA
EVGA - 800 Watts
2.2 The computing environment
<<One common definition of big data is any data that is too big to fit in the memory your computer has.>>
Running a series of sequential simulations of VASP on Miztli
2.2 The computing environment
<<A lot of serious computing is still done at the command line.>>
How to crop pages of PDF to the greatest enclosing box?
https://www.baeldung.com/linux/pdf-files-crop-
cli#:~:text=To%20crop%20PDF%20pages%2C%20we,by
%20the%20poppler%2Dutils%20package.
2.2 The computing environment
1. Easy to learn
2. Free and open source
3. Third party modules
4. Strong community
5. Compatibility
6. Libraries
7. Speed
<<Python is a general-purpose programming language that was
designed to emphasize code readability and simplicity. While
not originally built for data science applications per se, various
libraries augment Python’s native capabilities: for example,
pandas for storing and manipulating tabular data, NumPy for
efficient arrays, SciPy for scientific programming, and scikit-
learn for machine learning.>>
2.2 The computing environment
IDE:
1. Syntax highlighting
2. Linters clean up code
3. Debugging tools
4. Project management
5. Code completion
6. Version control
<< Notebooks are especially valuable in educational settings. Rather than
having two documents—one containing code and the other explaining the
code, usually requiring awkward references to line numbers in a different
file, notebooks allow students to see code embedded in narrative
explanations that appear right before and after the code. >>
2.3 Best practices
“Coding like poetry should be short and concise.” ― Santosh Kalwar
“Code is like humor. When you must explain it, it’s bad.” – Cory House
“Make it work, make it right, make it fast.” – Kent Beck
1. Write readable code
2. Don’t repeat yourself
<< Abstracting tasks into functions ultimately makes your code more
readable; rather than seeing the guts of a function repeated throughout
a computation, we see them defined and named once, and then that
descriptive name repeated throughout the computation, which makes
the meaning of the code much more obvious.>>
3. Set seeds for
random processes
<< Two computers running the same pseudorandom-
number-generating algorithm starting with the same seed
will produce identical sequences of pseudorandom values.>>
2.3 Best practices
4. Profile, benchmark and
optimize judiciously
6. Don’t rely on black boxes
5. Test your code
2.3 Best practices
<< The key problem is that there is no single algorithm that performs best under all circumstances. Every data problem
presents unique challenges. A good data scientist will be familiar not only with a variety of algorithms, but also with the
circumstances under which those algorithms are appropriate. They need to understand any assumptions or conditions
that must apply. They must know how to interpret the output and ascertain to what degree the results are reliable and
valid. Many of the chapters of this book are specifically designed to help the reader avoid some of the pitfalls of using the
wrong algorithms at the wrong times. >>
<< It may not be necessary in all cases to scrutinize every line of code that implements an algorithm. But it is worthwhile
to find a paper that explains the main ideas and theory behind it. By virtue of their training, physicists are in a great
position to read technical literature and make sense of it. >>
<< Another suggestion for using algorithms appropriately is to use some fake data—perhaps data that is simulated to
have certain properties—to test algorithms that are new to you. That way you can check that the algorithms generate the
results you expect in a more controlled environment. >>
Theoretical
Experimental
Computational
Physics
2.4 Data-centric coding
<< XML stands for eXtensible Markup Language and uses tags, like HTML
does. It’s a fun exercise to rename a Microsoft Excel file to have a .zip
extension, unzip it, and explore the underlying XML files. >>
read_excel -> function in pandas on Python
<<Even easier, if you can open the spreadsheet in Excel, you can export it in a
plain text format that’s easier to parse.>>
1. Obtaining data
<< The same file shown in three different plain text formats: CSV (left),
TSV (center), and fixed-width (right) with fields of size 13, 6, and 10.>>
<< The go-to web scraping tool in Python is Beautiful Soup.>>
Database: SQL, short for Structured Query Language.
NoSQL databases use a variety of systems to store data, including key-value pairs, document stores, and graphs.
2.4 Data-centric coding
2. Data
structures
2.4 Data-centric coding
<< Processing matrices is easy due to advanced linear algebra libraries that make matrix operations very efficient. Python
has the numpy library that defines array-like structures like matrices. >>
A list in Python with three elements: a list of ten numbers, a list of
two strings, and a dictionary with five key-value pairs.
<< Pandas library that was built to handle tabular data. Each column (of one specific type) is called a Series, and a
collection of Series is called a DataFrame. >>
2. Data
structures
2.4 Data-centric coding
<< When we obtain data, it’s almost never in a form that is suitable for doing immediate analysis.>>
1. Each set of related observations forms a table.
2. Each row in a table represents an observation.
3. Each column in a table represents a variable.
<< Often, the most time-consuming task in the data pipeline is tidying the data, also called cleaning, wrangling, munging,
or transforming. Every dataset comes with its own unique data-cleaning challenges, but there are a few common
problems one can look for.>>
3. Cleaning Data
Tidy Data
Missing
Data
Data
Values
Outliers Other
issues
2.4 Data-centric coding
<< The matplotlib library is somewhat analogous to base R graphics: hard to use, but with much more flexibility and fine
control. Other popular and easier-to-use options are seaborn and Bokeh. You can also use the ggplot package,
implemented to emulate R’s ggplot2. >>
4. Exploratory Data Analysis (EDA)
2.5 Getting help
2.5 Getting help
2.5 Conclusion
Find a project to start working on it…
<< Find some data and try to clean it. Raw data is plentiful on the Internet. You might try Kaggle or the U.S. government
site data.gov. (The latter has lots of data in weird formats to give you some practice importing from a wide variety of
file types.) You can find data on any topic by typing that topic in any search engine and appending the word “data.” >>
<< Try some web scraping. Find a web page that interests you, preferably one with some cool data presented in a
tabular format. (Be sure to check that it’s okay to scrape that site. “Open” projects like Wikipedia are safe places to
start.) Find a tutorial for a popular web scraping tool and mimic the code you see there, adapting it to the website
you’ve chosen. Along the way, you’ll likely have to learn a little about HTML and CSS. Store the scraped data in a data
format like a data frame that is idiomatic in your language of choice.>>
S2-Programming_with_Data_Computational_Physics.pdf

S2-Programming_with_Data_Computational_Physics.pdf

  • 1.
    Programa de Física Docente:Carlos Andrés Vidal Betancourt Física Computacional 1 S2 - Programming with Data
  • 2.
    Overview Chapter 2 –Programming with Data 2.1 Introduction 2.2 The computing environment 2.3 Best practices 2.4 Data-centric coding 2.5 Getting help 2.6 Conclusion Sean Raleigh Westminster College
  • 3.
  • 4.
    2.1 Introduction <<The mostimportant tool in the data science tool belt is the computer. No amount of statistical or mathematical knowledge will help you analyze data if you cannot store, load, and process data using technology.>> <<The aim of this chapter is to introduce you to some aspects of computing and computer programming that are important for data science applications.>> <<A project that can be reproduced is one that bundles together the raw data along with all the code used to take that data through the entire pipeline of loading, processing, cleaning, transforming, exploring, summarizing, visualizing, and analyzing it.>> https://github.com/VectorPosse/Programming_with_Data
  • 5.
    2.2 The computingenvironment <<The choice of hardware for doing data science depends heavily on the task at hand.>> ASUS Rock Strick X399 Processor: AMD RYZEN - Threadripper 16 Cores / 32 Threads. f=4.5 GHz / Cache 132 MB RAM - DDR5 64 GB / 3.2 GHz Video Card – NVIDIA Titan 2070X – 8 GB DDR6 21 GHz / 2304 Cores SSD 1TB Read/Write 3.5 GB/s Master cooler (Graphene) Hardware UPS 10 kVA EVGA - 800 Watts
  • 6.
    2.2 The computingenvironment <<One common definition of big data is any data that is too big to fit in the memory your computer has.>> Running a series of sequential simulations of VASP on Miztli
  • 7.
    2.2 The computingenvironment <<A lot of serious computing is still done at the command line.>> How to crop pages of PDF to the greatest enclosing box? https://www.baeldung.com/linux/pdf-files-crop- cli#:~:text=To%20crop%20PDF%20pages%2C%20we,by %20the%20poppler%2Dutils%20package.
  • 8.
    2.2 The computingenvironment 1. Easy to learn 2. Free and open source 3. Third party modules 4. Strong community 5. Compatibility 6. Libraries 7. Speed <<Python is a general-purpose programming language that was designed to emphasize code readability and simplicity. While not originally built for data science applications per se, various libraries augment Python’s native capabilities: for example, pandas for storing and manipulating tabular data, NumPy for efficient arrays, SciPy for scientific programming, and scikit- learn for machine learning.>>
  • 9.
    2.2 The computingenvironment IDE: 1. Syntax highlighting 2. Linters clean up code 3. Debugging tools 4. Project management 5. Code completion 6. Version control << Notebooks are especially valuable in educational settings. Rather than having two documents—one containing code and the other explaining the code, usually requiring awkward references to line numbers in a different file, notebooks allow students to see code embedded in narrative explanations that appear right before and after the code. >>
  • 10.
    2.3 Best practices “Codinglike poetry should be short and concise.” ― Santosh Kalwar “Code is like humor. When you must explain it, it’s bad.” – Cory House “Make it work, make it right, make it fast.” – Kent Beck 1. Write readable code 2. Don’t repeat yourself << Abstracting tasks into functions ultimately makes your code more readable; rather than seeing the guts of a function repeated throughout a computation, we see them defined and named once, and then that descriptive name repeated throughout the computation, which makes the meaning of the code much more obvious.>> 3. Set seeds for random processes << Two computers running the same pseudorandom- number-generating algorithm starting with the same seed will produce identical sequences of pseudorandom values.>>
  • 11.
    2.3 Best practices 4.Profile, benchmark and optimize judiciously 6. Don’t rely on black boxes 5. Test your code
  • 12.
    2.3 Best practices <<The key problem is that there is no single algorithm that performs best under all circumstances. Every data problem presents unique challenges. A good data scientist will be familiar not only with a variety of algorithms, but also with the circumstances under which those algorithms are appropriate. They need to understand any assumptions or conditions that must apply. They must know how to interpret the output and ascertain to what degree the results are reliable and valid. Many of the chapters of this book are specifically designed to help the reader avoid some of the pitfalls of using the wrong algorithms at the wrong times. >> << It may not be necessary in all cases to scrutinize every line of code that implements an algorithm. But it is worthwhile to find a paper that explains the main ideas and theory behind it. By virtue of their training, physicists are in a great position to read technical literature and make sense of it. >> << Another suggestion for using algorithms appropriately is to use some fake data—perhaps data that is simulated to have certain properties—to test algorithms that are new to you. That way you can check that the algorithms generate the results you expect in a more controlled environment. >> Theoretical Experimental Computational Physics
  • 13.
    2.4 Data-centric coding <<XML stands for eXtensible Markup Language and uses tags, like HTML does. It’s a fun exercise to rename a Microsoft Excel file to have a .zip extension, unzip it, and explore the underlying XML files. >> read_excel -> function in pandas on Python <<Even easier, if you can open the spreadsheet in Excel, you can export it in a plain text format that’s easier to parse.>> 1. Obtaining data << The same file shown in three different plain text formats: CSV (left), TSV (center), and fixed-width (right) with fields of size 13, 6, and 10.>> << The go-to web scraping tool in Python is Beautiful Soup.>> Database: SQL, short for Structured Query Language. NoSQL databases use a variety of systems to store data, including key-value pairs, document stores, and graphs.
  • 14.
  • 15.
    2.4 Data-centric coding <<Processing matrices is easy due to advanced linear algebra libraries that make matrix operations very efficient. Python has the numpy library that defines array-like structures like matrices. >> A list in Python with three elements: a list of ten numbers, a list of two strings, and a dictionary with five key-value pairs. << Pandas library that was built to handle tabular data. Each column (of one specific type) is called a Series, and a collection of Series is called a DataFrame. >> 2. Data structures
  • 16.
    2.4 Data-centric coding <<When we obtain data, it’s almost never in a form that is suitable for doing immediate analysis.>> 1. Each set of related observations forms a table. 2. Each row in a table represents an observation. 3. Each column in a table represents a variable. << Often, the most time-consuming task in the data pipeline is tidying the data, also called cleaning, wrangling, munging, or transforming. Every dataset comes with its own unique data-cleaning challenges, but there are a few common problems one can look for.>> 3. Cleaning Data Tidy Data Missing Data Data Values Outliers Other issues
  • 17.
    2.4 Data-centric coding <<The matplotlib library is somewhat analogous to base R graphics: hard to use, but with much more flexibility and fine control. Other popular and easier-to-use options are seaborn and Bokeh. You can also use the ggplot package, implemented to emulate R’s ggplot2. >> 4. Exploratory Data Analysis (EDA)
  • 18.
  • 19.
  • 20.
    2.5 Conclusion Find aproject to start working on it… << Find some data and try to clean it. Raw data is plentiful on the Internet. You might try Kaggle or the U.S. government site data.gov. (The latter has lots of data in weird formats to give you some practice importing from a wide variety of file types.) You can find data on any topic by typing that topic in any search engine and appending the word “data.” >> << Try some web scraping. Find a web page that interests you, preferably one with some cool data presented in a tabular format. (Be sure to check that it’s okay to scrape that site. “Open” projects like Wikipedia are safe places to start.) Find a tutorial for a popular web scraping tool and mimic the code you see there, adapting it to the website you’ve chosen. Along the way, you’ll likely have to learn a little about HTML and CSS. Store the scraped data in a data format like a data frame that is idiomatic in your language of choice.>>