KEMBAR78
Python Programming1 | PDF | Python (Programming Language) | Numerical Analysis
0% found this document useful (0 votes)
31 views27 pages

Python Programming1

The document discusses the use of Python for data analysis, highlighting its popularity and the advantages of its libraries like NumPy, pandas, and scikit-learn. It also addresses the integration of Python with legacy code and its suitability for both research and production environments, while noting some limitations such as performance issues in certain applications. Additionally, it covers essential Python libraries for data analysis and tools for enhancing productivity in coding and data visualization.

Uploaded by

Amit Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views27 pages

Python Programming1

The document discusses the use of Python for data analysis, highlighting its popularity and the advantages of its libraries like NumPy, pandas, and scikit-learn. It also addresses the integration of Python with legacy code and its suitability for both research and production environments, while noting some limitations such as performance issues in certain applications. Additionally, it covers essential Python libraries for data analysis and tools for enhancing productivity in coding and data visualization.

Uploaded by

Amit Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 27

PYTHON

PROGRAMMING
What Kinds of Data?
The primary focus is on structured data, a deliberately vague term that
encompasses many different common forms of data, such as:
• Tabular or spreadsheet-like data in which each column may be a
different type (string, numeric, date, or otherwise).
This includes most kinds of data commonly stored in relational
databases or tab- or comma-delimited text files.
• Multidimensional arrays (matrices).
• Multiple tables of data interrelated by key columns (what would be
primary or foreign keys for a SQL user).
• Evenly or unevenly spaced time series
This is by no means a complete list. Even though it may not always be obvious, a large
percentage of datasets can be transformed into a structured form that is more suitable for
analysis and modeling. If not, it may be possible to extract features from a dataset 1 into a
structured form. As an example, a collection of news articles could be processed into a
word frequency table, which could then be used to perform sentiment analysis. Most users
of spreadsheet programs like Microsoft Excel, perhaps the most widely used data analysis
tool in the world, will not be strangers to these kinds of data
Why Python for Data Analysis?

Python has become one of the most popular interpreted


programming languages, along with Perl, Ruby, and others.

Among interpreted languages, for various historical and cultural


reasons, Python has developed a large and active scientific
computing and data analysis community. In the last 10 years,
Python has gone from a bleeding-edge or “at your own risk”
scientific computing language to one of the most important
languages for data science, machine learning, and general
software development in academia and industry.

For data analysis and interactive computing and data visualization,


Python will inevi- tably draw comparisons with other open source
and commercial programming lan guages and tools in wide use,
such as R, MATLAB, SAS, Stata, and others.

Python’s improved support for libraries (such as pandas and scikit-


learn) has made it a popular choice for data analysis tasks.
Python as Glue
Part of Python’s success in scientific computing is the ease
of integrating C, C++, and FORTRAN code.

Most modern computing environments share a similar set


of legacy FORTRAN and C libraries for doing linear algebra,
optimization, integration, fast Fourier transforms, and other
such algorithms.

The same story has held true for many companies and
national labs that have used Python to glue together
decades’ worth of legacy software.
Solving the “Two-Language”
Problem
In many organizations, it is common to research, prototype, and
test new ideas using a more specialized computing language like
SAS or R and then later port those ideas to be part of a larger
production system written in, say, Java, C#, or C++.

What people are increasingly finding is that Python is a suitable


language not only for doing research and prototyping but also for
building the production systems.

Why maintain two development environments when one will


suffice?

I believe that more and more companies will go down this path, as
there are often significant organizational benefits to having both
researchers and software engineers using the same set of
programming tools.
Why Not Python?
While Python is an excellent environment for building many kinds
of analytical applications and general-purpose systems, there are a
number of uses for which Python may be less suitable.

As Python is an interpreted programming language, in general


most Python code will run substantially slower than code written in
a compiled language like Java or C++. As programmer time is
often more valuable than CPU time, many are happy to make this
trade-off. However, in an application with very low latency or
demanding resource utilization requirements (e.g., a high-
frequency trading system), the time spent programming in a lower-
level (but also lower-productivity) language like C++ to achieve
the maximum possible performance might be time well spent.

Python can be a challenging language for building highly


concurrent, multithreaded applications, particularly applications
with many CPU-bound threads. The reason for this is that it has
what is known as the global interpreter lock (GIL), a mechanism
that prevents the interpreter from executing more than one Python
Essential Python Libraries
NumPy
NumPy, short for Numerical Python, has long been a cornerstone of numerical computing in
Python
It provides the data structures, algorithms, and library glue needed for most scientific
applications involving numerical data in Python.

NumPy contains, among other things:

• A fast and efficient multidimensional array object ndarray


• Functions for performing element-wise computations with arrays or mathematical
operations between arrays
• Tools for reading and writing array-based datasets to disk
• Linear algebra operations, Fourier transform, and random number generation
• A mature C API to enable Python extensions and native C or C++ code to access
NumPy’s data structures and computational facilities
• Beyond the fast array-processing capabilities that NumPy adds to Python, one of its
primary uses in data analysis is as a container for data to be passed between algorithms
and libraries.
• For numerical data, NumPy arrays are more efficient for storing and manipulating data
than the other built-in Python data structures.
• Also, libraries written in a lower-level language, such as C or Fortran, can operate on the
data stored in a NumPy array without copying data into some other memory
representation.
Pandas
• The primary objects in pandas that will be used are the DataFrame, a tabular,
column-oriented data structure with both row and column labels, and the Series,
a one-dimensional labeled array object
• Pandas blends the high-performance, array-computing ideas of NumPy with the
flexible data manipulation capabilities of spreadsheets and relational databases
(such as SQL).
• provides sophisticated indexing functionality to make it easy to reshape, slice
and dice, perform aggregations, and select subsets of data.
• Data structures with labeled axes supporting automatic or explicit data
alignment —this prevents common errors resulting from misaligned data and
working with differently indexed data coming from different sources
• Integrated time series functionality
• The same data structures handle both time series data and non–time series
data
• Arithmetic operations and reductions that preserve metadata
• Flexible handling of missing data
• Merge and other relational operations found in popular databases (SQL-based,
for example)
Matplotlib
matplotlib is the most popular Python library for producing
plots and other two dimensional data visualizations.

It is designed for creating plots suitable for publication.


While there are other visualization libraries available to
Python programmers, matplotlib is the most widely used
and as such has generally good integration with the rest
of the ecosystem.
Ipython
it does not provide any computational or data analytical tools by
itself, IPython is designed from the ground up to maximize your
productivity in both interactive computing and software
development.

It encourages an execute-explore workflow instead of the typical


edit-compile run workflow of many other programming languages.

It also provides easy access to your operating system’s shell and


filesystem.

Since much of data analysis coding involves exploration, trial and


error, and iteration, IPython can help you get the job done faster
SciPy
SciPy is a collection of packages addressing a number of different standard
problem domains in scientific computing.

scipy.integrate Numerical integration routines and differential equation solvers

scipy.linalg Linear algebra routines and matrix decompositions extending beyond


those provided in numpy.linalg

scipy.optimize Function optimizers (minimizers) and root finding algorithms

scipy.signal Signal processing tools

scipy.sparse Sparse matrices and sparse linear system solvers

scipy.special Wrapper around SPECFUN, a Fortran library implementing many


common mathematical functions, such as the gamma function

scipy.stats Standard continuous and discrete probability distributions (density


functions, samplers, continuous distribution functions), various statistical tests, and
more descriptive statistics

Together NumPy and SciPy form a reasonably complete and mature computational
scikit-learn
Since the project’s inception in 2010, scikit-learn has become the
premier general purpose machine learning toolkit for Python
programmers. In just seven years, it has had over 1,500 contributors
from around the world. It includes submodules for such models as:
• Classification: SVM, nearest neighbors, random forest, logistic
regression, etc.
• Regression: Lasso, ridge regression, etc.
• Clustering: k-means, spectral clustering, etc.
• Dimensionality reduction: PCA, feature selection, matrix
factorization, etc.
• Model selection: Grid search, cross-validation, metrics
• Preprocessing: Feature extraction, normalization

Along with pandas, statsmodels, and IPython, scikit-learn has been


critical for ene- bling Python to be a productive data science
programming language.
Statsmodels
statsmodels is a statistical analysis package that was seeded by work from Stanford
University statistics professor Jonathan Taylor, who implemented a number of regression
analysis models popular in the R programming language. Skipper Seabold and Josef
Perktold formally created the new statsmodels project in 2010 and since then have grown
the project to a critical mass of engaged users and contributors. Nathaniel Smith
developed the Patsy project, which provides a formula or model specification framework
for statsmodels inspired by R’s formula system.
Compared with scikit-learn, statsmodels contains algorithms for classical (primarily
frequentist) statistics and econometrics.
This includes such submodules as:
• Regression models: Linear regression, generalized linear models, robust linear models,
linear mixed effects models, etc.
• Analysis of variance (ANOVA)
• Time series analysis: AR, ARMA, ARIMA, VAR, and other models
• Nonparametric methods: Kernel density estimation, kernel regression
• Visualization of statistical model results statsmodels is more focused on statistical
inference, providing uncertainty estimates and p-values for parameters.

scikit-learn, by contrast, is more prediction-focused.


Installing or Updating Python Packages
In general, these can be installed with the following command:
conda install package_name

If this does not work, you may also be able to install the package
using the pip package management tool:

C:\Users\DELL\AppData\Local\Programs\Python\Python312\Scripts

pip install package_name

You can update packages by using the conda update command:

conda update package_name

pip also supports upgrades using the --upgrade flag:

pip install --upgrade package_name


Python is an interpreted language.

The Python interpreter runs a program by executing one statement


at a time.

The standard interactive Python interpreter can be invoked on the


command line with the python command:
$ python
Python 3.6.0 | packaged by conda-forge | (default, Jan 13 2017,
23:17:12) [GCC 4.8.2 20140120 (Red Hat 4.8.2-15)] on linux Type
"help", "copyright", "credits" or "license" for more information.
>>> a = 5
>>> print(a)
5
The >>> you see is the prompt where you’ll type code
expressions.
To exit the Python interpreter and return to the command prompt,
you can either type exit() or press Ctrl-D.
Running Python programs is as simple as calling python with a .py file
as its first argument.
Suppose we had created hello_world.py with these contents:
print('Hello world’)

$ python hello_world.py

Hello world

Tab Completion On the surface, the IPython shell looks like a


cosmetically different version of the standard terminal Python
interpreter (invoked with python). One of the major improvements over
the standard Python shell is tab completion, found in many IDEs or
other interactive computing analysis environments. While entering
expressions in the shell, pressing the Tab key will search the
namespace for any variables (objects, functions, etc.) matching the
characters you have typed so far:
In [1]: an_apple = 27
In [2]: an_example = 42
In [3]: an an_apple and an_example anu
In [3]: b = [1, 2, 3]
In [4]: b.
b.append b.count b.insert b.reverse b.clear b.extend b.pop b.sort
b.copy b.index b.remove
The same goes for modules:
In [1]: import datetime
In [2]: datetime.
datetime.date datetime.MAXYEAR datetime.timedelta
datetime.datetime datetime.MINYEAR datetime.timezone
datetime.datetime_CAPI datetime.time datetime.tzinfo
In the Jupyter notebook and newer versions of IPython (5.0 and
higher), the auto completions show up in a drop-down box rather
than as text output.
Introspection
Using a question mark (?) before or after a variable will display
some general infor mation about the object:
In [8]: b = [1, 2, 3]
In [9]: b?

Type: list String Form:[1, 2, 3] Length: 3 Docstring: list() -> new


empty list list(iterable) -> new list initialized from iterable's items

In [10]: print?

Docstring: print(value, ..., sep=' ', end='\n', file=sys.stdout,


flush=False)
This is referred to as object introspection. If the object is a function or
instance method, the docstring, if defined, will also be shown.
Suppose we’d written the follow ing function (which you can
reproduce in IPython or Jupyter):
[In] def add_numbers(a, b):
""" Add two numbers together
Returns
------
the_sum : type of arguments
"""
return a + b

[In] add_numbers?

Signature: add_numbers(a, b)
Docstring: Add two numbers together
Returns------
the_sum :
type of arguments
File: Type: function
Using ?? will also show the function’s source code if possible:

In [12]: add_numbers??

Signature: add_numbers(a, b)
Source:
def add_numbers(a, b):
""" Add two numbers together
Returns ------
the_sum : type of arguments
"""
return a + b
File: Type: function
A number of characters combined with the wildcard (*) will show
all names matching the wildcard expression. For example, we
could get a list of all functions in the top-level NumPy namespace
containing load:

import numpy as np

In [13]: np.*load*?

np.__loader__
np.load
np.loads
np.loadtxt
np.pkgload
The %run Command
You can run any file as a Python program inside the environment of your IPython session
using the %run command. Suppose you had the following simple script stored in
jupyter_script_test.py:

def f(x, y, z):


return (x + y) / z
a=5
b=6
c = 7.5
result = f(a, b, c)

You can execute this by passing the filename to %run:

In [14]: %run jupyter_script_test.py

The script is run in an empty namespace (with no imports or other variables defined) so
that the behavior should be identical to running the program on the command line using
python script.py. All of the variables (imports, functions, and globals) defined in the file
(up until an exception, if any, is raised) will then be accessible in the IPython shell:

In [15]: c
Out [15]: 7.5
In [16]: result
Out[16]: 1.4666666666666666
In the Jupyter notebook, you may also use the related %load magic
function, which imports a script into a code cell:

>>> %load jupyter_script_test.py

def f(x, y, z):


return (x + y) / z
a=5
b=6
c = 7.5
result = f(a, b, c)
About Magic Commands
Jupyter’s special commands (which are not built into Python itself) are known as “magic” commands.
These are designed to facilitate common tasks and enable you to easily control the behavior of the
Jupyter system. A magic command is any command prefixed by the percent symbol %. For example, you
can check the execution time of any Python statement, such as a matrix multiplication, using the
%timeit magic function

import numpy as np

In [20]: a = np.random.randn(100, 100)

In [20]: %timeit np.dot(a, a)

10000 loops, best of 3: 20.9 µs per loop

Magic commands can be viewed as command-line programs to be run within the jupyter system. Many
of them have additional “command-line” options, which can all be viewed (as you might expect) using ?:

In [21]: %debug?

Docstring:

:: %debug [--breakpoint FILE:LINE] [statement [statement ...]]

Activate the interactive debugger. This magic command support two ways of activating debugger. One is
to activate debugger before executing code. This way, you can set a break point, to step through the
code from the point. You can use this mode by giving statements to execute and optionally a breakpoint.
Magic functions can be used by default without the percent sign, as long as no variable is
defined with the same name as the magic function in question. This feature is called
automagic and can be enabled or disabled with %automagic.

Some magic functions behave like Python functions and their output can be assigned to a
variable:

In [22]: %pwd

Out[22]: '/home/wesm/code/pydata-book

In [23]: foo = %pwd

In [24]: foo

Out[24]: '/home/wesm/code/pydata-book’
Some frequently used Jupyter magic commands
Matplotlib Integration
One reason for IPython’s popularity in analytical computing is that it integrates well with
data visualization and other user interface libraries like matplotlib.
The %matplotlib magic function configures its integration with Jupyter notebook.
In Jupyter, the command is
In [26]: %matplotlib inline

You might also like