#review
History
1. The POSTGRES project, led by Professor Michael Stonebraker, was
sponsored by the Defense Advanced Research Projects Agency
(DARPA), the Army Research Office (ARO), the National Science
Foundation (NSF), and ESL, Inc.
2. The implementation of POSTGRES began in 1986. The initial concepts
for the system were presented in The design of POSTGRES , and the
definition of the initial data model appeared in The POSTGRES data
model . The design of the rule system at that time was described in The
design of the POSTGRES rules system. The rationale and architecture of
the storage manager were detailed in The design of the POSTGRES
storage system.
https://web.mit.edu/cygwin/cygwin_v1.3.2/usr/doc/postgresql-
7.1.2/html/rules.html
https://dsf.berkeley.edu/papers/ERL-M87-06.pdf
3. In 1994, Andrew Yu and Jolly Chen added an SQL language interpreter
to POSTGRES. Under a new name, Postgres95 was subsequently
released to the web to find its own way in the world as an open-source
descendant of the original POSTGRES Berkeley code.
4. By 1996, it became clear that the name "Postgres95" would not stand the
test of time. We chose a new name, PostgreSQL, to reflect the
relationship between the original POSTGRES and the more recent
versions with SQL capability. At the same time, we set the version
numbering to start at 6.0, putting the numbers back into the sequence
originally begun by the Berkeley POSTGRES project.
PostgreSQL
PostgreSQL is cross platform and runs on many operating systems such
as Linux, FreeBSD, OS X, Solaris and Windows
PostgreSQL features transactions with ACID properties
PostgreSQL manages concurrency through multiversion concurrency
control (MVCC)
MVCC
MVCC is a database optimization method that allows multiple transactions to
access the same data concurrently without locking, by keeping multiple
versions of data objects.
This approach improves performance and scalability by enabling reads to
occur without waiting for writes to complete, and vice versa, thereby reducing
conflicts and delays in a multi-user environment.
PostgreSQL - tools
Installation resources - https://www.postgresql.org/download/
Documentation - https://www.postgresql.org/docs/
We will be working on Windows. Remember however that PostgreSQL for
production purposes will be usually hosted in Linux environment.
PostgreSQL on Windows
Performance: PostgreSQL is known to perform better on UNIX-like systems.
This is due to differences in how Windows and UNIX-like operating systems
handle file system operations and process management. PostgreSQL's
architecture and optimization are more naturally suited to the UNIX-like
environments.
File System: The file systems commonly used on UNIX-like systems (e.g.,
EXT4, XFS on Linux) are often more efficient at handling the types of
read/write operations databases perform, compared to the NTFS file system
on Windows.
Process Forking: PostgreSQL relies on process forking for its operation, a
task that is more efficiently performed in UNIX-like systems. Windows does
not natively support fork(), and though PostgreSQL works around this, it incurs
additional overhead.
Support and Community: The PostgreSQL community and the majority of its
development effort are focused on UNIX-like systems. This means better
support, more frequent updates, and a wider array of tools and extensions for
these systems.
Operational Tools and Scripts: Many of the operational tools, scripts, and
best practices are designed with UNIX-like systems in mind. This includes
backup tools, monitoring solutions, and performance tuning utilities.
Deployment Standards: UNIX-like systems are a common choice for server
environments, making them a standard choice for deploying production
databases. This standardization helps in leveraging community knowledge,
documentation, and experienced personnel.
Security: UNIX-like systems have a long history of being used in multi-user,
networked environments, leading to a mature and robust security model that's
well-suited for protecting a database system.
While it's technically possible to run PostgreSQL on Windows for development
or testing, for production environments, a UNIX-like system is generally
preferred for its performance, reliability, and ecosystem support.
General architecture overview
1. Postmaster Process: When PostgreSQL starts, it launches the
postmaster process. This process listens on the specified port for
incoming connection requests from client applications.
2. Forking New Processes: Upon receiving a connection request, the
postmaster forks a new backend process for each client connection. This
means that for each client connected to the database, there is a separate
server process handling that connection.
3. Dedicated Backend Process: Each backend process is dedicated to
serving its client: executing queries, accessing the database files, and
performing transactions. This process-based approach isolates the
activities of different clients, enhancing security and stability.
4. No Threading: Unlike some other database management systems that
use threading, PostgreSQL's use of separate processes means that it
does not rely on threading to handle concurrent connections. This design
choice is partly why PostgreSQL is traditionally favored on UNIX-like
systems, which efficiently manage processes.
5. Shared Memory and IPC: PostgreSQL backend processes
communicate with each other and manage concurrency control (locks,
shared buffers, etc.) using shared memory and inter-process
communication (IPC). This allows for efficient data sharing and
coordination among the processes.
6. Connection Pooling (External): Because each connection is handled
by a separate process, the overhead can become significant with a high
number of concurrent connections. To mitigate this, external connection
poolers (like PgBouncer) are often used to manage a pool of connections
that client applications can share, reducing the overhead on the
PostgreSQL server.
https://www.pgbouncer.org/
Processes vs threads
Connection pooling and the use of threads versus processes are two
important architectural choices that database management systems (DBMSs)
make to handle concurrent connections and operations efficiently.
These choices significantly affect the DBMS's performance, scalability, and
how it manages resources.
PostgreSQL: PostgreSQL uses a process-based model for handling client
connections. When a client connects to PostgreSQL, the main server process
( postmaster ) forks a new backend process dedicated to handling that
connection. Each connection has its own process, and these processes are
isolated from each other, which improves stability and security but can
increase overhead, especially with a large number of connections.
PostgreSQL does not include built-in connection pooling.
MySQL: MySQL traditionally uses a thread-based model for handling client
connections, especially in its default storage engine, InnoDB. Each client
connection is handled by a separate thread in the server process.
This model can be more memory-efficient than a process-based model
because threads share the same address space and resources, reducing the
per-connection overhead.
MySQL provides built-in connection pooling in its client libraries, like
Connector/J for Java.
Oracle: Oracle can use both processes and threads, depending on the
configuration and the operating system. On UNIX and Linux, it uses a
process-based model similar to PostgreSQL but also supports threading for
certain operations and components, combining the advantages of both
approaches.
Oracle includes built-in connection pooling through features like Oracle
Connection Pooling and Database Resident Connection Pooling (DRCP).
Microsoft SQL Server: SQL Server uses a thread-based model to handle
client connections. It employs a thread pool to manage the execution of these
connections, optimizing resource use and allowing for a high degree of
concurrency with lower overhead compared to a process-based model.
SQL Server supports connection pooling through its client libraries, such as
ADO.NET for .NET applications, which manage pools of connections to
reduce the overhead of opening and closing connections frequently.
PostgreSQL - choice
PostgreSQL opts for a process-based approach for handling connections and
relies on external tools for connection pooling. This choice emphasizes
stability, security, and isolation at the expense of higher resource usage with
many connections.
In contrast, databases like MySQL and SQL Server use thread-based models
and provide built-in support for connection pooling, aiming for efficiency and
scalability, particularly in environments with a high number of concurrent
connections.
Oracle offers a more flexible approach, capable of utilizing both processes
and threads, and provides comprehensive connection pooling mechanisms.
PostgreSQL - high level architecture
Shared memory
In PostgreSQL, shared memory is a critical component that allows different
server processes to access common data structures and buffers, enhancing
performance and efficiency.
Shared buffers
These are a part of the shared memory where PostgreSQL caches table data
and indexes. By keeping frequently accessed data in memory, PostgreSQL
can significantly reduce disk I/O, speeding up data retrieval and query
execution. The size of the shared buffers is configurable, and tuning it
correctly is vital for database performance.
WAL buffers
WAL stands for Write-Ahead Logging. WAL buffers are used to temporarily
hold changes to the database before they are written to the disk in the WAL
files. This mechanism ensures data integrity and allows the database to
recover from crashes by replaying these logs. The WAL buffers are an
essential part of PostgreSQL's durability and data integrity guarantees.
CLOG buffers
CLOG stands for Commit Log. These buffers are used to track the commit
status of transactions. PostgreSQL uses this information to determine whether
a transaction has been committed or rolled back, which is crucial for
maintaining transactional integrity and supporting features like point-in-time
recovery.
Temp buffers
Temp buffers are used for temporary tables. When queries involve operations
that need intermediary storage, like complex sorts or hash operations, and the
data does not fit into the work memory, PostgreSQL uses temporary tables.
Temp buffers are allocated separately from the main shared buffers to hold
data for these temporary tables, preventing the main cache from being
polluted by temporary data that is not needed after the transaction ends.
File types
In PostgreSQL, several key file types play vital roles in database storage,
transaction logging, error logging, and data recovery. Here's a brief overview
of each:
Data files
These files store the actual data of the database, including tables, indexes,
and other database objects. Data files are organized into a directory structure
under the PostgreSQL data directory ( PGDATA ). Each table and index is
stored in one or more files on the disk. The organization and naming of these
files are managed by PostgreSQL and are transparent to users.
Page files
1. Page is the smallest unit of data storage
2. Every table and index is stored as an array of pages of fixed size
3. By default, in PostgreSQL the page size is 8kb
4. We can configure different page size during compiling the server
5. All pages are logically equivalent, and any row can be stored on any
page
Page Header (Header Info):
*pd_lsn: Log sequence number indicating the WAL location for the latest
change to the page.
pd_checksum: Optional checksum value for this page.
pd_flags: Flags to provide additional information about the page, such
as whether it's been pruned.
pd_lower: The offset to the location in the page where the list of item
pointers (line pointers) ends.
pd_upper: The offset to the free space on the page (the start of the
hole).
pd_special: The offset to the location of special data at the end of the
page, such as index-specific data.
pd_pagesize_version: Combined field for the size of the page and
version code.
pd_prune_xid: Transaction ID of the last prune operation on the page.
Line pointers:
These are item pointers that act as an array index; each one points to a tuple
(row) within the page. They show the actual position of the tuple on the page
and allow tuples to be moved around within a page to keep them contiguous
without changing the line pointer array.
Heap Tuples (Record Data):
These are the actual data records (tuples) stored on the page. A tuple consists
of one or more fields, corresponding to table columns.
Free Space (Hole):
The area between pd_upper and pd_special is considered free space, which
can be used to store new tuples or to accommodate the expansion of existing
tuples.
This structure allows PostgreSQL to efficiently manage the tuples within a
page, providing rapid access to records, as well as space and transaction
management. The page header contains meta-information about the page,
which the database system uses to manage the page's contents and maintain
consistency and integrity.
WAL Files (Write-Ahead Logging Files)
WAL files are crucial for data integrity and recovery. PostgreSQL uses write-
ahead logging to record every change made to the database's data in these
files before the changes are written to the actual data files. This approach
ensures that, in the event of a crash, the database can be recovered to a
consistent state by replaying these logs. WAL files are stored in the pg_wal
directory (formerly known as pg_xlog in versions before PostgreSQL 10).
Using WAL results in a significantly reduced number of disk writes, because
only the WAL file needs to be flushed to disk to guarantee that a transaction is
committed, rather than every data file changed by the transaction. The WAL
file is written sequentially, and so the cost of syncing the WAL is much
less than the cost of flushing the data pages. This is especially true for
servers handling many small transactions touching different parts of the data
store. Furthermore, when the server is processing many small concurrent
transactions, one fsync of the WAL file may suffice to commit many
transactions.
Log files
PostgreSQL generates log files to record various operational events, including
errors, warnings, and informational messages about database operations. The
configuration of logging is highly customizable, allowing administrators to
specify what types of events are logged and where the logs are stored. Log
files are essential for diagnosing problems, monitoring database activity, and
auditing.
Archive files
Archive files are essentially copies of WAL files that are stored outside the
main PostgreSQL data directory for backup and recovery purposes. Archiving
WAL files is part of a disaster recovery strategy, enabling point-in-time
recovery (PITR) or replication. The process of archiving is controlled by the
archive_command setting in PostgreSQL's configuration, which specifies how
to copy each WAL segment to a designated archive location.
Utility processes
In PostgreSQL, several background utility processes work together to ensure
the database's smooth operation, data integrity, and performance optimization.
Writer (Background writer)
The Background Writer process is responsible for writing modified pages from
shared buffers (in-memory cache of database blocks) to the disk. This process
helps in reducing disk I/O and managing memory efficiently by writing out
"dirty" pages in the background, allowing the database to handle more
transactions.
WAL Writer
The Write-Ahead Logging (WAL) Writer process is tasked with writing WAL
records from the in-memory WAL buffers to the disk. This is crucial for
ensuring that all changes to the database are recorded promptly in the WAL
files, which are essential for database recovery and durability.
Checkpointer
The Checkpointer process periodically writes a consistent state of the
database to disk, creating checkpoints. Checkpoints are important for
recovery processes because they mark a point in time from which the
database can be restored using the WAL files, reducing recovery time.
Archiver process
The Archiver process copies completed WAL files to a designated archive
location for safekeeping. This archiving is part of PostgreSQL's continuous
archiving feature, enabling point-in-time recovery (PITR) and supporting
replication setups.
Logging collector
This process is responsible for capturing log messages and redirecting them
to the appropriate log files or other destinations as configured. The Logging
Collector consolidates logging from various server processes, helping in
centralized log management and analysis.
Stats collector
The Statistics Collector gathers information about database activity, such as
the number of operations performed on tables and indexes, and system usage
statistics. This data is essential for monitoring, performance tuning, and query
optimization, as it provides insights into how the database is being used.
Autovacuum launcher
The Autovacuum Launcher process oversees the autovacuum workers, which
are responsible for automatic vacuuming (cleanup of dead tuples) and
analyzing (updating statistics for the optimizer) of the database tables. This
process helps in reclaiming storage space, preventing table bloat, and
keeping the statistics used by the query planner up-to-date, which is crucial for
maintaining good database performance.
Processing the SQL command - high level
Parsing
Check Query Syntax: The parser verifies that the SQL query is written
correctly according to the SQL grammar rules. This includes checking for
proper use of keywords, commas, parentheses, and other syntax
elements.
Check Query Type: The parser determines the type of the SQL query,
such as SELECT, INSERT, UPDATE, DELETE, etc. This helps the
database understand what action is being requested.
Split the Query into Separate Elements: The parser breaks down the
SQL query into its constituent elements, such as the SELECT clause,
WHERE clause, JOIN conditions, etc. This dissection helps in creating a
parse tree or an abstract syntax tree that represents the query's logical
structure.
Optimization
Generate Query Execution Plan: The optimizer evaluates various ways
to execute the query and generates one or more possible execution
plans. These plans outline different strategies for how the database can
retrieve or modify the data.
Choose Statistics: The optimizer selects relevant statistical information
about the database tables and indexes involved in the query. This
information can include the size of tables, cardinality of fields, and
distribution of values, which are crucial for making informed decisions
about the most efficient way to execute the query.
Calculate Cost: For each potential execution plan, the optimizer
calculates an estimated cost, which represents the resources needed to
execute the plan, such as CPU, memory, and I/O. This estimation is
based on the query's complexity, the size of the data, and the statistics
chosen.
Choose Execution Plan: Based on the cost estimations, the optimizer
selects the most efficient execution plan for the query. This is the plan
that the database expects to consume the least amount of resources and
complete the operation in the shortest time.
Execution
Query Execution: This is the phase where the database engine carries out
the steps outlined in the chosen execution plan. It involves accessing the
physical tables and indexes, performing joins, applying filters, aggregating
data, and finally, presenting the results to the user or application. The
execution phase interacts with the storage engine to read or modify the actual
data stored on disk.
Step-by-step - select operation
1. Query Reception: The client application sends a SELECT query to the
PostgreSQL server, which is picked up by one of the backend processes
dedicated to that client connection.
2. Parsing and planning: explained the the text above.
3. Shared buffers: Before actual execution, the database checks its shared
buffers, which is the in-memory cache where PostgreSQL stores pages
read from the disk. If the data needed for the query is already in the
shared buffers, it can be accessed directly, saving time by avoiding disk
I/O.
1. If the required data is not in the shared buffers, PostgreSQL loads it
from the data files on disk into the shared buffers.
2. PostgreSQL implements a Least Recently Used (LRU) strategy to
manage the shared buffers. If the buffers are full, the least recently
used data is evicted to make room for new data.
4. Execution: With the execution plan chosen and data in shared buffers,
PostgreSQL proceeds with the execution. This involves:
1. Accessing the required rows in the shared buffers.
2. Performing any necessary joins, sorting, filtering, and aggregation
as dictated by the execution plan.
3. Generating the result set.
5. WAL (Write-Ahead Logging): For a SELECT statement, the WAL is
generally not involved because there is no data modification. However,
WAL comes into play if the query triggers any data changes, such as
updates caused by a function or a trigger.
6. Background Writer and WAL Writer: Although they are not directly
involved in reading operations, these processes ensure that the shared
buffers are efficiently written back to disk and that the WAL records are
flushed to the storage. This ensures that the shared buffers have enough
space for new data and that the system can recover in case of a crash.
7. Autovacuum: While not directly involved in the execution of a specific
SELECT query, the autovacuum process is indirectly important. It
maintains table and index health by cleaning up dead tuples and
updating statistics, which helps the planner generate efficient execution
plans.
8. Result Return: Finally, the backend process returns the result set of the
SELECT query to the client application.
Step-by-step - update
1. Query Reception: The UPDATE query arrives from the client and is
received by the PostgreSQL server process dedicated to the client's
connection.
2. Parsing and Planning: The query is parsed to validate its syntax and
then a parse tree is created. The planner/optimizer takes this parse tree
and generates a query plan, using statistics to determine the most
efficient way to locate and update the targeted rows.
3. Shared Buffers: PostgreSQL checks if the data pages to be updated are
in the shared buffers.
1. If the data is in the shared buffers, it can be directly accessed.
2. If not, the relevant pages are loaded from the disk into the shared
buffers.
4. WAL Write: Before the data pages are modified in the shared buffers, the
changes are recorded in the WAL. This ensures that, in case of a crash,
the database can recover by replaying the WAL entries.
5. Execute Update:
1. The database engine locks the rows to be updated to maintain
isolation.
2. The actual update is performed on the in-memory data pages within
the shared buffers.
3. If row versions are moved because of the update (due to row size
change), index entries pointing to the row are updated, and the new
row version is marked as the latest.
4. Visibility information is updated to reflect that these rows have a
new version.
6. Background Writer and WAL Writer
1. The background writer may write dirty pages from shared buffers to
the disk, but it usually does so based on its regular scheduling.
2. The WAL writer flushes WAL records to the disk at intervals to
ensure that the log is up-to-date on disk.
7. CLOG Update: The commit log (CLOG) is updated to reflect that the
transactions have been committed, ensuring the changes are permanent
and visible to other transactions according to the isolation rules.
8. Autovacuum and Freezing
1. The autovacuum process eventually cleans up the old row versions
that are no longer visible to any transactions to reclaim space and
maintain performance.
2. During this, it might also "freeze" old rows to avoid transaction ID
wraparound issues by marking them as permanently visible to all
transactions.
9. Checkpointer
1. The checkpointer process periodically forces both dirty shared
buffers and WAL records to the disk, creating consistent
checkpoints.
2. This process ensures data is not lost and aids in faster recovery if
the system crashes.
10. Archiver
1. If archiving is enabled, completed WAL segments are copied to the
archive location after they are full or a switch has been requested.
This is essential for point-in-time recovery and replication.
11. Result Return: After the transaction is committed, PostgreSQL returns a
command tag to the client application, indicating the number of rows
updated.
Step-by-step - INSERT
1. Query Reception: When a client sends an INSERT query to
PostgreSQL, it is received by the PostgreSQL server process (backend)
that is handling the client connection.
2. Parsing and Planning: The query is parsed to check the syntax, and the
planner/optimizer generates a plan for the query. The plan includes
where and how to insert the rows.
3. Shared Buffers: PostgreSQL checks its shared buffers to see if the pages
into which the new rows will be inserted are already in memory.
1. If the pages are in the shared buffers, PostgreSQL can directly
insert the new rows into these pages.
2. If the pages are not in memory, PostgreSQL reads them from the
disk into the shared buffers.
4. WAL Logging: Before making any changes, PostgreSQL writes the insert
action to the WAL. This ensures that the database can recover the insert
if there is a crash before the changes are written to disk.
5. Execute Insert: PostgreSQL then:
1. Inserts the new row(s) into the appropriate page in the shared
buffers.
2. If indexes are affected by the insert, it updates the indexes
accordingly.
3. Updates visibility information to indicate the presence of the new
row(s).
6. Background Writer and WAL Writer: While these processes are not
directly involved in the INSERT command execution, they are essential
for durability.
1. The Background Writer may write dirty pages from the shared
buffers to the disk.
2. The WAL Writer ensures the WAL is written to disk at regular
intervals.
7. Commit: The transaction is committed, making the insert permanent. The
commit log (CLOG) is updated to reflect this.
8. Autovacuum: This background process will later clean up dead tuples left
by superseded versions of rows, which is not usually an immediate part
of an INSERT operation but is critical for long-term maintenance.
9. Return Result: PostgreSQL returns a result to the client, typically the
number of rows inserted or a confirmation that the insertion was
successful.
OS cache
PostgreSQL configuration
https://pgtune.leopard.in.ua/
The "out-of-the-box" configuration of PostgreSQL is deliberately conservative
to ensure it will start up on a wide variety of hardware and software
environments.
This conservative setup is often not optimized for specific workloads or
hardware configurations, leading to the perception that additional configuration
is necessary for optimal performance.
Editing configuration
1. Edit in postgresql.conf
2. Restart server or run SELECT pg_reload_conf();
3. SHOW work_mem;
shared_buffers
It is always faster to read or write data in memory than on any other media. A
database server also needs memory for quick access to data, whether it is
READ or WRITE access. In PostgreSQL, this is referred to as "shared buffers"
and is controlled by the parameter shared_buffers.
As a starting point, a common recommendation is to set shared_buffers to
about 25% to 40% of the available system RAM on a dedicated database
server.
For systems with a lot of RAM, setting shared_buffers to a value larger than
32GB might not yield better performance due to the way operating systems
manage memory. It's often more effective to rely on the operating system's
cache beyond a certain point.
work_mem
work_mem is another crucial memory-related configuration parameter in
PostgreSQL, determining the amount of memory to be used by internal sort
operations and hash tables before writing to temporary disk files.
This setting plays a significant role in the performance of queries that involve
sorting, aggregations, or complex joins.
Usage per Query: Unlike shared_buffers , which is shared across all
connections, work_mem is allocated for each sort, hash, or
materialization operation within a query.
Risk of Oversizing: Setting work_mem too high can lead to excessive
memory usage because each session or query operation that needs
memory can allocate up to the specified amount. On systems with many
concurrent queries, this can quickly add up and potentially exhaust
available memory.
Tuning for Workload: There's no one-size-fits-all value for work_mem; it
depends on your workload, the query patterns, and the system's RAM. A
common approach is to start with a conservative value and increase it if
you notice queries are frequently spilling to disk (observable through the
EXPLAIN ANALYZE command or monitoring tools).
A typical starting point might be a few megabytes, such as 4MB or 8MB. You
might adjust this based on the available system memory and the nature of
your queries.
maintenance_work_mem
maintenance_work_mem is a configuration parameter in PostgreSQL that
determines the maximum amount of memory used for maintenance
operations, such as VACUUM , CREATE INDEX , and ALTER TABLE ADD FOREIGN
KEY . It's specifically designed to improve the performance of these operations
by allowing them to use more memory, which can significantly reduce their
execution time, especially on large databases.
A common recommendation is to start with a value between 128MB to 1GB for
systems with ample memory, adjusting based on your specific needs and
observations. Another recommendation is to use 5% of the available RAM.
temp_buffers
temp_buffers is a PostgreSQL configuration parameter that determines the
amount of memory allocated for temporary tables within each database
session.
Temporary tables are those explicitly created with the TEMPORARY keyword
in SQL commands. This setting is important for workloads that heavily rely on
temporary tables, as it can significantly affect performance by reducing disk
I/O for operations involving these tables.
The default value is typically relatively low (e.g., 8MB). If your application
makes extensive use of temporary tables and you have sufficient memory,
increasing this value can improve performance. However, since this memory
is not shared and each session gets its own allocation, setting this value too
high on systems with many concurrent connections could lead to excessive
overall memory consumption.
effective_cache_size
effective_cache_size is a configuration parameter in PostgreSQL that provides
the database optimizer with information about the expected size of the
system's cache (including both PostgreSQL's own cache and the operating
system's disk cache).
This parameter helps the query planner make more informed decisions,
especially when choosing between different query plans that involve disk I/O
operations, such as sequential scans versus index scans.
Usually we set it for around 50%-75% of the available RAM remory.
max_wal_size
max_wal_size is a PostgreSQL configuration parameter that specifies the
maximum size of WAL (Write-Ahead Logging) files that can accumulate on
your disk before a checkpoint is forced. This setting helps manage the amount
of WAL data that must be written to disk and retained, which can affect your
database's performance, especially for write-heavy workloads, and influence
recovery time after a crash.
Purpose: Increasing max_wal_size allows PostgreSQL to delay checkpoints,
potentially improving performance by reducing disk I/O. However, a larger
WAL can also mean more data to process during recovery, so it's about
finding a balance.
Default Value: PostgreSQL has a default setting (e.g., 1GB in some versions),
but the optimal size depends on your workload and disk capacity Companion
Setting: min_wal_size is another parameter that works alongside
max_wal_size to define a range within which the WAL size can fluctuate.
PostgreSQL will try to recycle WAL files up to min_wal_size without deleting
them, providing a buffer for bursts of write activity.
wal_buffers
Setting wal_buffers in PostgreSQL configures the amount of memory allocated
for storing WAL (Write-Ahead Logging) records before they are written to disk.
This parameter is crucial for the performance of write operations and overall
database throughput, especially in systems with high transaction rates.
Default Setting: PostgreSQL automatically sets wal_buffers to a default value
based on the setting of shared_buffers. The default is typically set to 1/32 of
the shared_buffers value, but not more than 64MB.
Cluster
In PostgreSQL, the term "cluster" refers to a collection of databases managed
by a single PostgreSQL server instance. This concept is slightly different from
what some might expect, especially if they're familiar with clustering in the
context of high availability or distributed databases. Here's a breakdown of
what a cluster means in PostgreSQL:
Server Instance: A single running process of PostgreSQL, which can
manage multiple databases.
Databases: Within a cluster, PostgreSQL can host multiple isolated
databases. These databases are separate from each other, meaning
that, by default, data and user roles are not shared across databases.
Data Directory: Every cluster has its own data directory, where all the
data files, including tables, indexes, and configuration files, are stored.
This directory is specified when the PostgreSQL server is initialized
(using initdb ) and is unique to that cluster.
Preparing the new cluster
1. First before creating a cluster its a good practice to define the PG_DATA
2. Add the path to binaries to PATH env variable
3. Create the directories for PG_DATA
1. One for PG_DATA
2. One for wal files
4. Add the permissions for user postgres to created directories
5. Use the initdb to create new cluster
6. Create the service in systemd
Check the initdb -help
Initdb –D /postgres/data –X /postgres/wal –E=UTF8 -W
systemd - service
Reloading configuration
Systemctl deamon-reload
Systemctl start postgresql
Systemctl status
Configuration of remote access
The pg_hba.conf file is a crucial configuration file in PostgreSQL, responsible
for controlling client authentication. "HBA" stands for Host-Based
Authentication.
This file specifies which users can connect to which databases, from which
hosts, and using which authentication methods. By editing pg_hba.conf,
database administrators can enforce security policies and manage access
controls in a fine-grained manner.
psql
psql is the interactive terminal client for PostgreSQL, providing a command-
line interface for interacting with the PostgreSQL database server. It allows
users to execute SQL queries, run scripts, and manage both the data and the
schema of a database directly from the terminal. psql is included with
PostgreSQL installation packages and is widely used by database
administrators and developers for its flexibility and full-featured support for
PostgreSQL's capabilities.
Key Features of psql
Interactive Commands: psql supports a wide range of interactive
commands that allow users to navigate databases, view table data, and
manage database objects.
Script Execution: Users can execute SQL script files from within psql
using commands like \i filename.sql , facilitating automation and
batch processing.
Metadata Queries: It provides commands for listing database objects
(databases, tables, indices, etc.) and viewing object information, which
are helpful for database exploration and administration.
Transaction Control: psql allows users to manually manage
transactions with SQL commands ( BEGIN , COMMIT , ROLLBACK ), giving
precise control over database operations.
Customizable Environment: Users can customize the behavior of psql
through various settings, such as output formatting options, and can use
.psqlrc file to apply their customizations automatically upon startup.
Extensive Help System: psql includes an integrated help system,
accessible via the \h (for SQL command help) and \? (for psql
command help) commands, making it easier to use even for newcomers.
Connecting to cluster
psql -h <hostname> -p <port> -U <user> -d <database>
We can also use environment variables.
PGHOST, PGPORT, PGDATABASE, PGUSER, PGPASSWORD
psql -U postgres
create database test1;
create database test2;
create user kuba with password 'kuba';
\du
psql -U kuba -d test1;
Help
Commands in psql can be divided into SQL commands and meta-commands.
\? - display help
Display list of databases
\l
or
\l+
Display list of tables
\dt
Running SQL command
psql -U postgres -c "select current_time"
psql -U postgres -c "select current_time;" -f
C:\Users\Piotr\Desktop\skrypt.sql -c "select current_time;„
psql -U postgres -c "select current_time;" -f
C:\Users\Piotr\Desktop\skrypt.sql -c "select current_time;„ -L
load_table.log
psql -U postgres -c "select current_time;" -f
C:\Users\Piotr\Desktop\skrypt.sql -c "select current_time;„ -o
load_table.log
Switching databases
\c nazwaBazy
\conninfo
\q
Watch
select * from pg_stat_activity;
\watch
Extended display
select * from pg_stat_activity;
\x
select * from pg_stat_activity;
Display help about command
\h UPDATE
Run command from the file
\i path_to_file
Execution time
Displays execution time for a command
\timing
Service file
Service file pozwala zautomatyzować na poziomie OS lub usera proces
łączenia się do psql
Na poziomie OS”:
/etc/pg_service.conf
Na poziomie konta użytkownika OS (można zmienić z pomocą
PGSERVICEFILE)
~/.pg_service.conf:
Wpis na poziomie użytkownika jest ważniejszy!
Logical structure of the database
In PostgreSQL, a template database serves as a blueprint for creating new
databases.
PostgreSQL comes with two default template databases: template0 and
template1. When you create a new database, PostgreSQL clones the
template database you specify (or template1 by default) to create the new
database. This mechanism allows for the new database to inherit not just the
schema (tables, data types, functions, etc.) but also the data from the
template database if it contains any.
template1
Default Template: template1 is the default template database.
Whenever you create a new database without specifying a template,
PostgreSQL uses template1 .
Customizable: Users can modify template1 , such as by adding
common extensions or user-defined data types, so that every new
database created from it will inherit those modifications. However,
because it is the default template, it's important to be cautious about the
changes made to it.
template0
Fallback Template: template0 is a clean, unmodified template. It exists
as a fallback solution so you can create a new database even if
template1 has been changed in such a way that it's no longer usable
for creating new databases.
Immutable: It is recommended to keep template0 unchanged to
ensure that there is always a stable template from which to create new
databases.
Working with template database - superuser
Only superuser can use other (non-template) database as template.
create database test1;
\c test1 postgres;
create table student(id int);
\c postgres postgres
create database clone_test1 template test1;
Working with template database - user
drop database clone_test1;
alter user jacek createdb;
\c postgres jacek
create database clone_test1 template test1;
create database clone_test1 template template0;
\c postgres postgres
alter user Jacek nocreatedb;
Showing list of schemas
\dn
Creating a new schema
Creating users/roles
Display users
\du
\h create user
\h create role
Create user michal with password ‚michal’ inherit;
Create role devteam;
\du
Alter role devteam createdb;
Grant devteam to michal;
Create database baza_michala;
Set role devteam;
Create database baza_michala;
\c postgres postgres
We will talk about permissions in details.
Bibliography
- Official documentation (https://www.postgresql.org/docs/)
- PostgreSQL 11 Administration Cookbook, Simon Riggs, Gianni Ciolli
- PostgreSQL Administration Essentials
- Additional materials - Łukasz Bartnicki, Lucian Oprea
- The internals of PostgreSQL - http://www.interdb.jp/pg/pgsql01.html
- https://www.postgresql.fastware.com/blog/back-to-basics-with-postgresql-
memory-components
https://archive.org/details/pgza18-Wait_Events_in_PostgreSQL