Site search

form.antibot { display: none !important; } You must have JavaScript enabled to use this form.

Aggregator

[latest] [categories]
Please use the contact form to include your blog here, suggest a feed, or give feedback.

If your blog is aggregated on OraFAQ, you may want to display this image on your blog:

Yann Neuhaus

URL: https://www.dbi-services.com/blog/

Updated: 11 hours 21 min ago

PostgreSQL 18: More granular log_connections

Thu, 2025-03-13 03:24

Many of our customers enable log_connections because of auditing requirements. This is a simple boolean which is either turned on or off. Once this is enabled and active every new connection to a PostgreSQL database is logged into the PostgreSQL log file. Up to PostgreSQL 17, a typical line in the log file for a logged connection looks like this:

2025-03-13 08:50:05.607 CET - 1 - 6195 - [local] - [unknown]@[unknown] - 0LOG:  connection received: host=[local]
2025-03-13 08:50:05.607 CET - 2 - 6195 - [local] - postgres@postgres - 0LOG:  connection authenticated: user="postgres" method=trust (/u02/pgdata/17/pg_hba.conf:117)
2025-03-13 08:50:05.607 CET - 3 - 6195 - [local] - postgres@postgres - 0LOG:  connection authorized: user=postgres database=postgres application_name=psql

As you can see, there are three stages logged: Connection received, authenticated and authorized. This gives you an idea of how long each of the stages took to complete by comparing the timestamps logged. A consequence of this is, that it can generate quite some noise in the log file if you have many connections.

With PostgreSQL 18 this will change, log_connections is not anymore a simple boolean but a list of supported values. The valid options are:

receipt
authentication
authorization
[empty string]

This list should already tell you what changed. You now have the option to enable logging of specific stages only, and not all of them at once if you don’t need them. An empty string disables connection logging.

So, e.g. if you are only interested in the authorization stage you can now configure that:

postgres@pgbox:/home/postgres/ [pgdev] psql
psql (18devel)
Type "help" for help.

postgres=# select version();
                              version                               
--------------------------------------------------------------------
 PostgreSQL 18devel on x86_64-linux, compiled by gcc-14.2.1, 64-bit
(1 row)

postgres=# alter system set log_connections = 'authorization';
ALTER SYSTEM
postgres=# select pg_reload_conf();
 pg_reload_conf 
----------------
 t
(1 row)

From now on only the “authorization” stage is logged into the log file:

2025-03-13 09:10:41.710 CET - 1 - 6617 - [local] - postgres@postgres - 0LOG:  connection authorized: user=postgres database=postgres application_name=psql

This reduces the amount of logging quite a bit, if you are only interested in that stage. Adding all stages will restore the old behavior of logging all stages:

postgres=# alter system set log_connections = 'authorization','receipt','authentication';
ALTER SYSTEM
postgres=# select pg_reload_conf();
 pg_reload_conf 
----------------
 t
(1 row)

With this setting, it looks exactly like before:

2025-03-13 09:14:19.520 CET - 1 - 6629 - [local] - [unknown]@[unknown] - 0LOG:  connection received: host=[local]
2025-03-13 09:14:19.521 CET - 2 - 6629 - [local] - postgres@postgres - 0LOG:  connection authenticated: user="postgres" method=trust (/u02/pgdata/PGDEV/pg_hba.conf:117)
2025-03-13 09:14:19.521 CET - 3 - 6629 - [local] - postgres@postgres - 0LOG:  connection authorized: user=postgres database=postgres application_name=psql

Nice, all details here, and as usual, thank you to all involved in this.

L’article PostgreSQL 18: More granular log_connections est apparu en premier sur dbi Blog.

Why Metadata Matters?

Tue, 2025-03-11 07:30

In today’s digital landscape, efficient Content Management is crucial for business productivity. While many organizations still rely on traditional folder-based file management. Modern solutions like M-Files offer a smarter way to store, organize, and retrieve information.

The key differentiator? Metadata. In this blog post, we’ll explore the limitations of traditional file management, the advantages of metadata-driven organization, and why M-Files is the best choice for businesses.

Traditional File Management

Most of the Document Management Systems (DMS) relies on a hierarchical folder structure where documents are manually placed in specific locations. While this approach is familiar and simple to use, that is not the best way for some reasons:

Difficult to Locate Files: Searching for documents requires navigating multiple nested folders, leading to wasted time and frustration.

File Duplication: The same document may be stored in different locations, increasing redundancy and the risk of outdated versions.

Limited Search Capabilities: Keyword searches often yield inaccurate results since traditional systems rely on file names rather than properties.

Rigid Structure: A document can only reside in one location, making it difficult to classify files that belong to multiple categories.

Version Control Issues: Without proper versioning, employees may work on outdated files, leading to errors and inefficiencies.

M-Files concept

Unlike traditional file systems, M-Files eliminates reliance on folders by organizing documents based on metadata—descriptive properties that define a document’s content, purpose, and relationships. Here’s why this approach is transformative:

Faster and Smarter Search: With M-Files, users can quickly find documents by searching for metadata fields such as document type, author, project name, or approval status. No more clicking through endless folders, simply enter relevant terms and get instant results.

Eliminates Duplication and Redundancy: Since documents are classified based on properties rather than locations, there’s no need for multiple copies stored across different folders. Users access the latest version from a single source of truth.

Dynamic Organization: Documents can be viewed in multiple ways without duplication. For example, a contract can appear under “Legal,” “Finance,” and “Project X” based on its metadata while existing as a single file.

Automated Workflows and Compliance: M-Files allows businesses to automate document workflows based on metadata. For instance, an invoice marked “Pending Approval” can be automatically routed to the finance team, ensuring compliance and efficiency.

Version Control: Every document change is automatically saved with a full version history, preventing accidental overwrites and ensuring teams always work with the most recent file.

What next?

If your business is still struggling with traditional file structures, it’s time to rethink how you manage information. M-Files provides a smarter, faster, and more flexible solution that aligns with modern business needs.

Moving your document to M-Files may look a huge amount of work, but fortunately it is not so!

Why? two main reasons:

M-Files provides solutions to smoothly move your data (see my other post here).
We are here to help you revolutionizing your document management

L’article Why Metadata Matters? est apparu en premier sur dbi Blog.

pgvector, a guide for DBA – Part2 indexes

Sun, 2025-03-09 16:44

Introduction

In the last blog post we did setup a LAB with a DVDRental database mixed with Netflix data on top which we created some embeddings hosted on the same tables thanks to pgvector. In this second part, we will start to look at improving query execution time with some indexes before extending on the rest of the AI workflow.

But first let’s talk about why developing those skills is so important.
Rollback 40 years ago. Back then we had applications storing data on ISAM databases. Basically flat files. Then we created this thing called SQL and the RDBMS with ACID properties to make it run. This moved the logic from the application only to have a bit in both sides. If you are a DBA with “some experience” you might remember the days when producing stored procedures, triggers with complex logic was complementary to the application code that DEVs couldn’t or wouldn’t change… the issue being that the code on the database side was harder to maintain.
Then 15-20 years ago we started to use ORMs everywhere and moved the logic from the RDBMS side to the ORM or app. This enabled to partially decouple the app from its backend database server and allow for having more sources (NOSQL DBMS) but also migrate easier from one system to another. This is why the lasts years we have a lot of migrations towards other RDBMS, it is today more easy than ever, “you just” have to add a new connector without changing the app. As DBA we went from tuning a lot of SQL code to almost fraction of it because the ORMs became better and better.
Where do I go with all of that ?
Well, with AI, the logic is going to leave the ORMs and app part and only be in the AI model tuned for your business. What does that mean ? It means that you will have an easier access to all of your data. You won’t have to wait for a team of developer to develop the panel of dashboards and reports you want in the app, you just going to ask the AI the data in the way you want it and it will just know of to connect to your database and build the query.
My point : consequence of that is that the databases are going to be hammered more than ever and most of the limitations that your going to have is the design choice of your database backend and your AI workflow. Not the app anymore. Learning how to optimize AI workflow and data retrieval will be key for the future.

So after this intro, let’s dive into those new PostgreSQL vector indexes a bit…

Index types in pgvector

pgvector supports natively two types of indexes that are approximate nearest neighbor (ANN) indexes. Unlike other indexes in PostgreSQL like B-tree or GiST, the ANN indexes trade some recall accuracy for query speed. What does it mean? It means that index searches with ANN won’t necessarily return the same return as a sequential scan.

HNSW Index

The Hierarchical Navigable Small World (HNSW) index builds a multi‐layer graph where each node (a vector) is connected to a fixed number of neighbors. During a search, the algorithm starts at the top (sparse) layer and “zooms in” through lower layers to quickly converge on the nearest neighbors.

Key parameters:
- m: Maximum number of links per node (default is 16).
- ef_construction: Candidate list size during index build (default is 64).
- ef_search: Size of the dynamic candidate list during query (default is 40; can be tuned per query).
Characteristics & Use Cases:
- Excellent query speed and high recall.
- Suitable for workloads where fast similarity search is critical—even if build time and memory consumption are higher.

Here is an example of a query and some index to support it.


CREATE INDEX film_embedding_idx 
ON public.film USING hnsw (embedding vector_l2_ops)

CREATE INDEX netflix_shows_embedding_idx 
ON public.netflix_shows USING hnsw (embedding vector_l2_ops)

Here is the execution plan of this query :

-- (Using existing HNSW index on film.embedding) 
EXPLAIN ANALYZE 
SELECT film_id, title 
FROM film 
ORDER BY embedding <-> '[-0.0060701305,-0.008093507...]' --choose any embedding from the table
LIMIT 5; 

Limit  (cost=133.11..133.12 rows=5 width=27) (actual time=7.498..7.500 rows=5 loops=1)
   ->  Sort  (cost=133.11..135.61 rows=1000 width=27) (actual time=7.497..7.497 rows=5 loops=1)
         Sort Key: ((embedding <-> '[-0.0060701305,-0.008093507,-0.0019467601,0.015574081,0.012467623,0.032596912,-0.0284>
         Sort Method: top-N heapsort  Memory: 25kB
         ->  Seq Scan on film  (cost=0.00..116.50 rows=1000 width=27) (actual time=0.034..7.243 rows=1000 loops=1)
 Planning Time: 0.115 ms
 Execution Time: 7.521 ms
(7 rows)

IVFFlat Index

IVFFlat (Inverted File with Flat compression) first partitions the vector space into clusters (lists) by running a clustering algorithm (typically k-means). Each vector is assigned to the nearest cluster (centroid). At query time, only a subset (controlled by the number of probes) of the closest clusters are scanned rather than the entire dataset

Key parameters:
- lists: Number of clusters created at index build time (a good starting point is rows/1000 for up to 1M rows or sqrt(rows) for larger datasets).
- probes: Number of clusters to search during a query (default is 1; higher values improve recall at the cost of speed).
Characteristics & Use Cases:
- Faster index build and smaller index size compared to HNSW.
- May yield slightly lower recall or slower query performance if too few probes are used.
- Best used when the dataset is relatively static (since the clusters are built only once).

CREATE INDEX film_embedding_ivfflat_idx 
ON film USING ivfflat (embedding vector_l2_ops) WITH (lists = 100);

With the same query as before we have now this execution plan :

 Limit  (cost=27.60..43.40 rows=5 width=27) (actual time=0.288..0.335 rows=5 loops=1)
   ->  Index Scan using film_embedding_ivfflat_idx on film  (cost=27.60..3188.50 rows=1000 width=27) (actual time=0.286..>
         Order By: (embedding <-> '[-0.0060701305,-0.008093507,-0.0019467601,0.015574081,0.012467623,0.032596912,-0.02841>
 Planning Time: 0.565 ms
 Execution Time: 0.375 ms
(5 rows)

Example Query:
After setting the number of probes, you can run:

SET ivfflat.probes = 5;

    -- New execution plan
    Limit  (cost=133.11..133.12 rows=5 width=27) (actual time=7.270..7.272 rows=5 loops=1)
   ->  Sort  (cost=133.11..135.61 rows=1000 width=27) (actual time=7.268..7.269 rows=5 loops=1)
         Sort Key: ((embedding <-> '[-0.0060701305,-0.008093507,-0.0019467601,0.015574081,0.012467623,0.032596912,-0.0284>
         Sort Method: top-N heapsort  Memory: 25kB
         ->  Seq Scan on film  (cost=0.00..116.50 rows=1000 width=27) (actual time=0.054..6.984 rows=1000 loops=1)
 Planning Time: 0.140 ms
 Execution Time: 7.293 ms
(7 rows)

By changing the number of probes :

SET ivfflat.probes = 10;

   -- New execution plan 
    Limit  (cost=104.75..120.17 rows=5 width=27) (actual time=0.459..0.499 rows=5 loops=1)
   ->  Index Scan using film_embedding_ivfflat_idx on film  (cost=104.75..3188.50 rows=1000 width=27) (actual time=0.458.>
         Order By: (embedding <-> '[-0.0060701305,-0.008093507,-0.0019467601,0.015574081,0.012467623,0.032596912,-0.02841>
 Planning Time: 0.153 ms
 Execution Time: 0.524 ms
(5 rows)

Understanding Lists and Probes in IVFFlat Indexes

When you create an IVFFlat index, you specify a parameter called lists. This parameter determines how many clusters (or “lists”) the index will divide your entire dataset into. Essentially, during index build time, the algorithm runs a clustering process (commonly using k-means) on your vectors and assigns each vector to the nearest cluster centroid.
For example, if you set lists = 100 during index creation, the entire dataset is divided into 100 clusters. Then at query time, setting probes = 1 means only the single cluster whose centroid is closest to the query vector is examined. Increasing probes (say, to 10) instructs PostgreSQL to examine 10 clusters—improving recall by reducing the chance of missing the true nearest neighbors at the cost of extra computation. Lists define the overall structure of the index, while probes control the breadth of search within that structure at runtime. They are related but serve different purposes: one is fixed at build time, and the other is adjustable at query time.
A higher number of lists means that each cluster is smaller. This can lead to faster candidate comparisons because fewer vectors lie in any one cluster, but if you over-partition the data, you might risk missing close vectors that fall on the boundary between clusters.
The probes parameter is a query-time setting that tells the index how many of these pre-computed clusters to search when processing a query.

Tuning Index Parameters

HNSW:
- Query time: Adjust hnsw.ef_search to trade-off between speed and recall. Higher values improve recall (more candidate vectors examined) but slow down queries.
- Build time: Increasing m or ef_construction can improve the quality (recall) of the index, but also increase build time and memory consumption.
IVFFlat:
- Lists: Choosing the right number of clusters is essential. Use rules like rows/1000 for smaller datasets or sqrt(rows) for larger ones.
- Probes: Increase the number of probes (using SET ivfflat.probes = value;) if recall is too low. This will widen the search among clusters at the cost of increased query time.

And then comes DiskANN Indexes

DiskANN was originally a research project from Microsoft Research and is open-sourced on GitHub. It uses a graph (like HNSW) but optimized for disk access patterns (paging neighbors in and out). Based on Microsoft research the team from Timescale produced pgvectorscale which is an open-source PostgreSQL extension that builds on top of the pgvector extension to provide DiskANN-based indexing for high-performance vector similarity search.
It introduces a new StreamingDiskANN index inspired by Microsoft’s DiskANN algorithm, along with Statistical Binary Quantization (SBQ) for compressing vector data.
What is the big deal with DiskANN and StreamingDiskANN ? Well, the main idea behind it is to avoid having all the information in memory (like with HNSW and IVFFlat) and store the bulk of the index on disk and only load small, relevant parts into memory when needed. This eliminates the memory bottleneck with massive datasets and provide a sub-linear scaling with RAM usage. DiskANN can handle billion‐scale datasets without demanding terabytes of memory.
These innovations significantly improve query speed and storage efficiency for embedding (vector) data in Postgres. In practice, PostgreSQL with pgvector + pgvectorscale has been shown to achieve dramatically lower query latencies and higher throughput compared to specialized vector databases!

How to install pgvectorscale ?

Building from Source: pgvectorscale is written in Rust (using the PGRX framework) and can be compiled and installed into an existing PostgreSQL instance (GitHub – timescale/pgvectorscale: A complement to pgvector for high performance, cost efficient vector search on large workloads.) (GitHub – timescale/pgvectorscale: A complement to pgvector for high performance, cost efficient vector search on large workloads.).

For my setup, I adapted the installation procedure provided to be able to install the pgvectorscale extension on my custom setup :

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

cargo install --force --locked cargo-pgrx --version 0.12.5

cargo pgrx init --pg17 /u01/app/postgres/product/17/db_2/bin/pg_config

cd /tmp
git clone --branch 0.6.0  https://github.com/timescale/pgvectorscale
cd pgvectorscale/pgvectorscale
cargo pgrx install --release

CREATE EXTENSION IF NOT EXISTS vectorscale CASCADE;

CREATE INDEX netflix_embedding_idx 
ON netflix_shows 
USING diskann (embedding vector_l2_ops);

I am planning to perform a full benchmark comparison with different solutions and with a AI/LLM as client in order to measure the real impact on performance. In the meantime, you might wanna check the Timescale benchmark on the matter compared to Pinecone: Pgvector vs. Pinecone: Vector Database Comparison | Timescale

To make a quick summary of the characteristics of each index types :

Index TypeBuild SpeedQuery SpeedMemory UsageMaintenance
Selectivity/AccuracyDiskANN (pgvectorscale)FastFastestLow (uses disk)Updatable (no rebuild)High (approximate, tuned)HNSW (pgvector)FastFastHigh (in-memory)Updatable (no rebuild)High (approximate)IVFFlat (pgvector)FastestSlowestLow/ModerateRebuild required after bulk changeModerat-high (tuning dependant)B-TREE Very fastSmalllownoneExact

Operators, Selectivity & Hybrid Search Optimizations

If you are familiar with RDBMS query tuning optimizations, you might already understand the importance of the selectivity of your data.

In the RDBMS world, if we index the color of the tree, “yellow” is selective, “green” is not. It is not the index that matters it’s the selectivity of the data you’re looking for to the optimizer’s eyes (yes an optimizer has eyes, and he sees everything!).

When performing and hybrid SQL/similarity search we will typically try to filter a subset of rows on top of which we will execute a similarity vector search. I kind of didn’t talk about that but it is one of the most important parts of having vectors in a database! You can look for similarities! What does it mean ?!
It allows you to run an approximate search against a table. That’s it. When you are looking for something that kind of reassembles what input you provide… In the first part of this blog series, I briefly explain the fact that embeddings encapsulate meanings into numbers in the form of a vector. The closer two vectors in that context, the more they are the same. In my example, we use text embedding to look for similarities but you have models that work with videos, sound, or images as well.
For example, Deezer is using “FLOW” which is looking for your music taste of the playlist you saved and is proposing similar songs in addition to the ones you already like. In addition, it can compare your taste profile with your mood (inputs or events of you skipping songs fast).

Contrary to the classic indexes in PostgreSQL, vector indexes are by definition non-deterministic.
They use approximations to narrow down the search space. In the index search process, the Recall measures the fraction of the data that is returned to be true. Higher recall means better precision at the cost of speed and index size. Running the same vector multiple times might slightly return different rankings or even different subset of the nearest neighbors. This means that DBAs have to play with some index parameters like ef_search, ef_construction, m ( for HNSW ) or lists and probes ( for IVFFlat).
Playing with those parameters will help find the balance between speed and recall.
One advice would be to start with the default and combine with deterministic filters before touching any parameters. In most cases having 99% or 98% of recall is really fine. But like in the deterministic world, understanding your data is key.
Here in my LAB example, I use the description field in both netflix_shows and film tables. Although the methodology I am using is good enough I would say, the results generated might be completely off because the film table of the dvdrental database are not real movies and their description are not reliable which is not the case for the netflix_shows table. So when I am looking at the user average embedding on their top movies rented… I am going to compare things that might not be comparable, which will provide an output that is probably not meaningful.

Quick reminder, when looking for similarities you can use two operators in your SQL query :
<-> (Euclidean): is Sensitive to magnitude differences.
<=> (Cosine): Focuses on the angle of vectors, often yielding more semantically relevant results for text embeddings.

Which one should you use and when ?
Well, initially embedding are created by embedding models, they define the usage of the magnitude in the vector or not. If your model is not using the magnitude parameter to encompass additional characteristics, then only the angle > Cosine operator makes sense. The Euclidean operator would work but is suboptimal since you try to compute something that isn’t there. You should rely on the embedding model documentation to make your choice or just test if the results you are having differ in meaning.
In my example, I use text-embedding-ada-002 model, where the direction (angle) is the primary carrier of semantic meaning.
Note that those operators are used to define your indexes ! Like for JSONB queries in PostgreSQL, not using the proper operator can make the optimizer not choose the index and use a sequential scan instead.
One other important aspect a DBA has to take into consideration is that the PostgreSQL optimizer is still being used and has still the same behaviour which is one of the strength of using pgvector. You already know the beast !
For example depending on the query and the data set size, the optimizer would find more cost-effective to run a sequential scan than using an index, my sample dvdrental data set is not really taking advantage of the indexes benefits in most cases. The tables have only 1000 and 8000 rows.

Let’s look at some interesting cases. This first query is not using an index. Here we are looking for the top 5 Netflix shows whose vector embeddings are most similar to a customer profile embedding which is calculated be the average of film embeddings the customer 524 has rented.


postgres=# \c dvdrental
You are now connected to database "dvdrental" as user "postgres".
dvdrental=# WITH customer_profile AS (
  SELECT AVG(f.embedding) AS profile_embedding
  FROM rental r
  JOIN inventory i ON r.inventory_id = i.inventory_id
  JOIN film f ON i.film_id = f.film_id
  WHERE r.customer_id = 524
)
SELECT n.title,
       n.description,
       n.embedding <-> cp.profile_embedding AS distance
FROM netflix_shows n, customer_profile cp
WHERE n.embedding IS NOT NULL
ORDER BY n.embedding <=> cp.profile_embedding
LIMIT 5;
            title             |                                                                     description                                                                      |      distance
------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------
 Alarmoty in the Land of Fire | While vacationing at a resort, an ornery and outspoken man is held captive by a criminal organization.                                               |  0.4504742864770678
 Baaghi                       | A martial artist faces his biggest test when he has to travel to Bangkok to rescue the woman he loves from the clutches of his romantic rival.       |  0.4517695114131648
 Into the Badlands            | Dreaming of escaping to a distant city, a ferocious warrior and a mysterious boy tangle with territorial warlords and their highly trained killers.  | 0.45256925901147355
 Antidote                     | A tough-as-nails treasure hunter protects a humanitarian doctor as she tries to cure a supernatural disease caused by a mysterious witch.            |  0.4526900472984078
 Duplicate                    | Hilarious mix-ups and deadly encounters ensue when a convict seeks to escape authorities by assuming the identity of his doppelgänger, a perky chef. | 0.45530040371443564
(5 rows)

dvdrental=# explain analyze WITH customer_profile AS (
  SELECT AVG(f.embedding) AS profile_embedding
  FROM rental r
  JOIN inventory i ON r.inventory_id = i.inventory_id
  JOIN film f ON i.film_id = f.film_id
  WHERE r.customer_id = 524
)
SELECT n.title,
       n.description,
       n.embedding <=> cp.profile_embedding AS distance
FROM netflix_shows n, customer_profile cp
WHERE n.embedding IS NOT NULL
ORDER BY n.embedding <=> cp.profile_embedding
LIMIT 5;
                                                                    QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=1457.26..1457.27 rows=5 width=173) (actual time=40.076..40.081 rows=5 loops=1)
   ->  Sort  (cost=1457.26..1479.27 rows=8807 width=173) (actual time=40.075..40.079 rows=5 loops=1)
         Sort Key: ((n.embedding <-> (avg(f.embedding))))
         Sort Method: top-N heapsort  Memory: 27kB
         ->  Nested Loop  (cost=471.81..1310.97 rows=8807 width=173) (actual time=1.822..38.620 rows=8807 loops=1)
               ->  Aggregate  (cost=471.81..471.82 rows=1 width=32) (actual time=1.808..1.812 rows=1 loops=1)
                     ->  Nested Loop  (cost=351.14..471.74 rows=25 width=18) (actual time=0.938..1.479 rows=19 loops=1)
                           ->  Hash Join  (cost=350.86..462.01 rows=25 width=2) (actual time=0.927..1.431 rows=19 loops=1)
                                 Hash Cond: (i.inventory_id = r.inventory_id)
                                 ->  Seq Scan on inventory i  (cost=0.00..70.81 rows=4581 width=6) (actual time=0.006..0.262 rows=4581 loops=1)
                                 ->  Hash  (cost=350.55..350.55 rows=25 width=4) (actual time=0.863..0.864 rows=19 loops=1)
                                       Buckets: 1024  Batches: 1  Memory Usage: 9kB
                                       ->  Seq Scan on rental r  (cost=0.00..350.55 rows=25 width=4) (actual time=0.011..0.858 rows=19 loops=1)
                                             Filter: (customer_id = 524)
                                             Rows Removed by Filter: 16025
                           ->  Index Scan using film_pkey on film f  (cost=0.28..0.39 rows=1 width=22) (actual time=0.002..0.002 rows=1 loops=19)
                                 Index Cond: (film_id = i.film_id)
               ->  Seq Scan on netflix_shows n  (cost=0.00..729.07 rows=8807 width=183) (actual time=0.004..3.301 rows=8807 loops=1)
                     Filter: (embedding IS NOT NULL)
 Planning Time: 0.348 ms
 Execution Time: 40.135 ms
(21 rows)

The optimizer is not taking advantage for several reasons. First, the data set size is small enough so that a sequential scan and an in-memory sort are in expensive compare to the overhead of using the index.
The second reason being that the predicate is not a constant value. The complexity of having “n.embedding <=> cp.profile_embedding” being computed for each row prevents the optimizer from “pushing down” the distance calculation into the index scan. Indexes are most effective to the optimizer’s eyes when the search key is “sargable”. Modifying a bit some parameters and the query is forcing the index usage :

SET enable_seqscan = off;
SET enable_bitmapscan = off;
SET enable_indexscan = on;
SET random_page_cost = 0.1;
SET seq_page_cost = 100;

dvdrental=# EXPLAIN ANALYZE
WITH customer_profile AS (
  SELECT AVG(f.embedding) AS profile_embedding
  FROM rental r
  JOIN inventory i ON r.inventory_id = i.inventory_id
  JOIN film f ON i.film_id = f.film_id
  WHERE r.customer_id = 524
)
SELECT n.title,
       n.description,
       n.embedding <=> (SELECT profile_embedding FROM customer_profile) AS distance
FROM netflix_shows n
WHERE n.embedding IS NOT NULL
  AND n.embedding <=> (SELECT profile_embedding FROM customer_profile) < 0.5
ORDER BY n.embedding <=> (SELECT profile_embedding FROM customer_profile)
LIMIT 5;
                                                                                           QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=146.81..147.16 rows=5 width=173) (actual time=2.450..2.605 rows=5 loops=1)
   CTE customer_profile
     ->  Aggregate  (cost=145.30..145.31 rows=1 width=32) (actual time=0.829..0.830 rows=1 loops=1)
           ->  Nested Loop  (cost=0.84..145.24 rows=25 width=18) (actual time=0.021..0.509 rows=19 loops=1)
                 ->  Nested Loop  (cost=0.57..137.87 rows=25 width=2) (actual time=0.017..0.466 rows=19 loops=1)
                       ->  Index Only Scan using idx_unq_rental_rental_date_inventory_id_customer_id on rental r  (cost=0.29..127.27 rows=25 width=4) (actual time=0.010..0.417 rows=19 loops=1)
                             Index Cond: (customer_id = 524)
                             Heap Fetches: 0
                       ->  Index Scan using inventory_pkey on inventory i  (cost=0.28..0.42 rows=1 width=6) (actual time=0.002..0.002 rows=1 loops=19)
                             Index Cond: (inventory_id = r.inventory_id)
                 ->  Index Scan using film_pkey on film f  (cost=0.28..0.29 rows=1 width=22) (actual time=0.002..0.002 rows=1 loops=19)
                       Index Cond: (film_id = i.film_id)
   InitPlan 2
     ->  CTE Scan on customer_profile  (cost=0.00..0.02 rows=1 width=32) (actual time=0.832..0.832 rows=1 loops=1)
   InitPlan 3
     ->  CTE Scan on customer_profile customer_profile_1  (cost=0.00..0.02 rows=1 width=32) (actual time=0.001..0.001 rows=1 loops=1)
   ->  Index Scan using netflix_embedding_cosine_idx on netflix_shows n  (cost=1.46..205.01 rows=2936 width=173) (actual time=2.449..2.601 rows=5 loops=1)
         Order By: (embedding <=> (InitPlan 2).col1)
         Filter: ((embedding IS NOT NULL) AND ((embedding <=> (InitPlan 3).col1) < '0.5'::double precision))
 Planning Time: 0.354 ms
 Execution Time: 2.643 ms
(21 rows)

Here we can observe that the query cost went down to 146 from 1457. By resetting the parameters of the session we can see that the index is still used and the query cost is now 504 which is still lower than the original query.

dvdrental=# RESET ALL;
RESET
dvdrental=# EXPLAIN ANALYZE
WITH customer_profile AS (
  SELECT AVG(f.embedding) AS profile_embedding
  FROM rental r
  JOIN inventory i ON r.inventory_id = i.inventory_id
  JOIN film f ON i.film_id = f.film_id
  WHERE r.customer_id = 524
)
SELECT n.title,
       n.description,
       n.embedding <=> (SELECT profile_embedding FROM customer_profile) AS distance
FROM netflix_shows n
WHERE n.embedding IS NOT NULL
  AND n.embedding <=> (SELECT profile_embedding FROM customer_profile) < 0.5
ORDER BY n.embedding <=> (SELECT profile_embedding FROM customer_profile)
LIMIT 5;
                                                                         QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=504.52..509.12 rows=5 width=173) (actual time=3.769..3.941 rows=5 loops=1)
   CTE customer_profile
     ->  Aggregate  (cost=471.81..471.82 rows=1 width=32) (actual time=2.129..2.131 rows=1 loops=1)
           ->  Nested Loop  (cost=351.14..471.74 rows=25 width=18) (actual time=1.286..1.825 rows=19 loops=1)
                 ->  Hash Join  (cost=350.86..462.01 rows=25 width=2) (actual time=1.279..1.781 rows=19 loops=1)
                       Hash Cond: (i.inventory_id = r.inventory_id)
                       ->  Seq Scan on inventory i  (cost=0.00..70.81 rows=4581 width=6) (actual time=0.008..0.271 rows=4581 loops=1)
                       ->  Hash  (cost=350.55..350.55 rows=25 width=4) (actual time=1.212..1.213 rows=19 loops=1)
                             Buckets: 1024  Batches: 1  Memory Usage: 9kB
                             ->  Seq Scan on rental r  (cost=0.00..350.55 rows=25 width=4) (actual time=0.017..1.208 rows=19 loops=1)
                                   Filter: (customer_id = 524)
                                   Rows Removed by Filter: 16025
                 ->  Index Scan using film_pkey on film f  (cost=0.28..0.39 rows=1 width=22) (actual time=0.002..0.002 rows=1 loops=19)
                       Index Cond: (film_id = i.film_id)
   InitPlan 2
     ->  CTE Scan on customer_profile  (cost=0.00..0.02 rows=1 width=32) (actual time=2.133..2.133 rows=1 loops=1)
   InitPlan 3
     ->  CTE Scan on customer_profile customer_profile_1  (cost=0.00..0.02 rows=1 width=32) (actual time=0.001..0.001 rows=1 loops=1)
   ->  Index Scan using netflix_embedding_cosine_idx on netflix_shows n  (cost=32.66..2736.11 rows=2936 width=173) (actual time=3.768..3.937 rows=5 loops=1)
         Order By: (embedding <=> (InitPlan 2).col1)
         Filter: ((embedding IS NOT NULL) AND ((embedding <=> (InitPlan 3).col1) < '0.5'::double precision))
 Planning Time: 0.448 ms
 Execution Time: 4.017 ms
(23 rows)

DO NOT PLAY AROUND WITH SESSION PARAMETERS ON PRODUCTION ENVIRONMENT TO PROVE A POINT.
Test environments are made for this and most likely session parameters or not the root cause of your issue.

If embeddings are still a bit hard to understand you might want to check this visualization of vector embeddings through this URL :
Embedding projector – visualization of high-dimensional data

That’s it for now. In the coming part 3 of this series we will talk about AI agents and workflows and how they relate to a DBA. For the part 4 we will discuss AI workflows Best practices. We will see if there is a need for a part 5.

Referenced links :

pgvector/pgvector: Open-source vector similarity search for Postgres

DiskANN Vector Index in Azure Database for PostgreSQL

microsoft/DiskANN: Graph-structured Indices for Scalable, Fast, Fresh and Filtered Approximate Nearest Neighbor Search

timescale/pgvectorscale: A complement to pgvector for high performance, cost efficient vector search on large workloads.

Timescale Documentation | SQL inteface for pgvector and pgvectorscale

Debunking 6 common pgvector myths

Vector Indexes in Postgres using pgvector: IVFFlat vs HNSW | Tembo

Optimizing vector search performance with pgvector – Neon

L’article pgvector, a guide for DBA – Part2 indexes est apparu en premier sur dbi Blog.

Oracle Technology Roundtable for Digital Natives – Let’s have a look at AI, Cloud and HeatWave

Fri, 2025-03-07 02:16

Yesterday I participated to the Oracle Technology Roundtable for Digital Natives in Zurich.

It was a good opportunity to learn more about AI, Cloud and HeatWave with the focus on very trendy features of this product: generative AI, machine learning, vector processing, analytics and transaction processing across data in Data Lake and MySQL databases.

It was also great to share moments with the Oracle and MySQL teams and meet customers which gave feedback and tips about their solutions already in place in this area.

I’ll try to summarize below some key take-away of each session.

Unlocking Innovation: How Oracle AI is Shaping the Future of Business (by Jürgen Wirtgen)

AI is not a new topic. But how do we use it today and where are we in the process, early or advanced?

To answer this question, you can have a look to the stages of adoption:

Consume (AI embedded in your applications) -> SaaS applications
Extend (models via Data Retrieval, RAG) -> AI services
Fine tune -> Data
Build models from scratch -> Infrastructure

AI is not AI. The best AI starts with the best data, securely managed. Which can be translated back into a simple equation: Best Data + Best Technology = Best AI.

Innovations in HeatWave & MySQL – The Present and the Future (by Cagri Balkesen)

HeatWave is an in-memory query processing accelerator for data in MySQL transactional RDBMS or data in Object Store in different format.

Normally you need to put in place and maintain ETL processes to produce data that can be used effectively by analytics. This brings several drawbacks:

Complexity
You have to maintain different systems, on which you’ll have security issues to handle and costs to assume.

Using HeatWave, you don’t need that anymore, because it’s a single platform which allows you to manage together all your OLTP, OLAP, Machine Learning and GenAI workloads.

Which are the advantages of using HeatWave?

You current SQL syntax doesn’t need to change
Changes to data are automatically propagated to HeatWave
Best performances for your queries
Efficient processing for Data Lake
Best platform for MySQL workloads
Built-in GenAI & Vector Store
Available in multi cloud (natively on OCI, it can run inside AWS, you can setup a private interconnection for Microsoft Azure, and there are works in progress for Google Cloud).

HeatWave is based on a massively parallel architecture which uses partitioning of data: each CPU core within a node processes the partitioned data in parallel.

Driven by Machine Learning algorithms, HeatWave Autopilot offers several features such as:

Improvements in terms of performance and scalability
Provisioning, data loading, query execution and fault management automation, to reduce human errors.

Finally, according to Oracle, with HeatWave you will have best performances and lowest costs rather than competitors: Snowflake, Amazon Redshift, Google BigQuery and Databricks.

Building Next-Gen Applications with Generative AI & Vector Store (by Adi Hochmann)

As we said, Oracle HeatWave allows you to manage together all your OLTP, OLAP, Machine Learning and GenAI workloads.

Steps to build a GenAI application are the following ones:

Create a vector store
Use vector store with LLMs

And this is performed using the following routines:

call sys.HEATWAVE_LOAD (…);

call sys.ML_RAG(@query,@output,@options);

But how to train data for Machine Learning? The process and tasks done by a data analyst could be complex and this is replaced here by the AutoML:

CALL sys.ML_TRAIN(‘data_source’, ‘model_type’, JSON_OBJECT(‘task’, ‘classification’), @result_model);

This is useful for some use-cases such as classification, anomalies detection, recommendations, predictive maintenance, …

Additional tip: Adi used MySQL Shell for VS Code to run his demo. This extension enables interactive editing and execution of SQL for MySQL databases and MySQL Database Service. It integrates the MySQL shell directly into VS Code development workflows and it’s pretty nice!

Oracle Cloud for Digital Natives: Supporting Innovation and Growth (by Claire Binder)

Which are the 5 reasons why Digital Natives picked OCI?

Developer-First openness and flexibility, to speed acquisition
Advanced Data & AI Services, to achieve innovation and agility
Technical and global reach, to achieve scalability
Security, compliance and resilience, to control risks
Cost efficiency and TCO, to achieve optimized spending.

Linked to the point 4, there are several services in OCI to avoid data breaches in terms of prevention, monitoring, mitigation, protection, encryption and access.

To select a Cloud provider, the recommendation would be to choose a solution which allows you to run converged open SQL databases, instead of single-use proprietary databases.

And finally, Oracle brings AI to your data with his new 23ai release and some of its features, such as Property Graphs and AI Vector Search.

Analytics at the speed of thoughts with HeatWave Lakehouse (by Nitin Kunal)

What is a Data Lake? It’s a cost efficient, scalable, online storage of data as file (for instance, Object Store). Data are not structured and non-transactional and can be sourced from Big Data frameworks.

Again, you could have 4 different platforms to maintain:

Your RDBMS
A DWH system for analytics processing
A Data Lake
A ML & Gen-AI system.

Instead of that, your can merge everything in only one platform: HeatWave. And you can query near real time data with HeatWave Lakehouse because new data is available in seconds: that’s great!

Conclusion

If you have mixed workloads, if you start to work with AI and if you want to improve your performances, it is really worth taking a look at Oracle HeatWave. You can try it here for free.
We all know that AI is the future. These next years, we’ll be more and more challenged on GenAI, ML, vector processing and so on. With all this innovation, we must not lose sight of topics that remain crucial (and perhaps become even more important) such as security, reliability, availability, and best practices. With dbi services and Sequotech we can for sure help you with this transition.

L’article Oracle Technology Roundtable for Digital Natives – Let’s have a look at AI, Cloud and HeatWave est apparu en premier sur dbi Blog.

M-Files Property Calculator

Mon, 2025-03-03 06:38

M-Files is a powerful product in itself, that enables businesses to organize, manage, and track documents efficiently.
But we can even make it more powerful by extending the out of the box capabilities with Property Calculator, a really useful tool.

Previously available unofficially (community tools) we can now find it in the M-Files catalog (here) and it’s free of charge!
As M-Files always focuses on automate knowledge work, it makes clearly sense to extend the capabilities thought PC (Property Calculator).

What it is?

M-Files Property Calculator allows users to automate property values and enhance metadata consistency within the system. It provides a way to manipulate objects based on rules without writing any line of code.

With this tool, users can perform calculations, concatenate text fields, extract values, and even apply conditional logic to metadata properties. This functionality is particularly useful in scenarios where document classification, numbering, or status updates need to be automated.

Key usages

Automated Property Calculations
The Property Calculator allows users to define formulas that automatically populate metadata fields based on predefined criteria. For example, an invoice document can have its total amount calculated based on line items and tax rates.
Concatenation and String Manipulation
Users can combine multiple metadata fields into a single property. This is useful for creating standardized naming conventions or reference numbers.
Conditional Logic for Metadata
With the ability to implement IF-THEN conditions, users can dynamically adjust property values. For example, if a document is marked as “Approved,” the status property can automatically update to “Final.”
Date and Time Calculations
Organizations can set up date-based calculations, such as automatically setting an expiration date based on the document creation date.
Reference-Based Updates
The calculator can pull information from related objects, ensuring consistency across interlinked documents and reducing manual updates.

In Short

You know, I’m in love with M-Files, but with PC we bring again more advantages, reduce manual actions and automatically decrease the potential human errors. But also save time for the users and avoid repetitive metadata filling. Without forgetting, increasing consistency and compliance ensuring that all the necessary document properties are properly populated and up to date.

The M-Files Property Calculator is an indispensable tool for organizations looking to enhance their document management processes through automation. By leveraging its capabilities, businesses can improve efficiency, maintain data accuracy, and streamline compliance efforts.

If you’re not yet using this feature, now is the time to explore its potential and transform how your organization handles metadata within M-Files.

For more information feel free to contact us

L’article M-Files Property Calculator est apparu en premier sur dbi Blog.

PostgreSQL: Unlogged tables and backups

Fri, 2025-02-28 03:36

Recently we faced the following situation and have been asked to analyze the issue: A dump was loaded into a primary instance of PostgreSQL, but the data did not reach the replica. Tables which have been added after the dump was loaded, however made it to the replica. After going through the logs it quickly became clear that the reason for this was, that the tables were create as unlogged.

To demonstrate what happens in that case we can easily create a simple test case:

postgres=# create unlogged table t ( a int );
CREATE TABLE
postgres=# insert into t select * from generate_series(1,1000000);
INSERT 0 1000000
postgres=# select count(*) from t;
  count  
---------
 1000000
(1 row)

This is one simple unlogged table containing one million row. Let’s create a backup of that and start a new instance from that backup:

postgres@pgbox:/home/postgres/ [pgdev] mkdir /var/tmp/xx
postgres@pgbox:/home/postgres/ [pgdev] pg_basebackup --checkpoint=fast --pgdata=/var/tmp/xx
postgres@pgbox:/home/postgres/ [pgdev] echo "port=8888" >> /var/tmp/xx/postgresql.auto.conf
postgres@pgbox:/home/postgres/ [pgdev] chmod 700 /var/tmp/xx/
postgres@pgbox:/home/postgres/ [pgdev] pg_ctl --pgdata=/var/tmp/xx/ start

Now we have a new instance which should be an exact copy of the source instance. The surprise comes once we check the table we’ve created:

postgres@pgbox:/home/postgres/ [pgdev] psql -p 8888
psql (18devel)
Type "help" for help.

postgres=# select count(*) from t;
 count 
-------
     0
(1 row)

The table is there, but it does not contain any rows. Exactly the same happens when you have a replica. This is also clearly stated in the documentation. When you go for unlogged tables, you must be aware of the consequences. If you skip WAL logging this makes those tables faster than ordinary tables, but this also comes with downsides.

If you dump those tables, you’ll have the data:

postgres@pgbox:/home/postgres/ [pgdev] pg_dump > a.sql
postgres@pgbox:/home/postgres/ [pgdev] grep -A 2 COPY a.sql 
COPY public.t (a) FROM stdin;
1
2

But please keep in mind: A dump is not a real backup and you cannot use it for point in time recovery.

L’article PostgreSQL: Unlogged tables and backups est apparu en premier sur dbi Blog.

Starting with PowerShell 7 and parallelization

Wed, 2025-02-26 02:49

For the time being Windows PowerShell 5.1 is installed with Windows Server. It means that if you want to use or even test PowerShell 7 you need to install it by your own.
To be honest, even if I’m using PowerShell as a DBA more or less every day, I did not take too much care of PowerShell 7 until I used it at a customer place with a new parallelization functionality that we will discuss later on.

Introduction

For reminder, PowerShell 7 is an open-source and cross-platform edition of PowerShell. It means that it can be used on Windows platforms but also on MacOS or Linux.
The good point is that you can install it without removing Windows PowerShell 5.1. Both version can cohabit because of:

Separate installation path and executable name
- path with 5.1 like $env:WINDIR\System32\WindowsPowerShell\v1.0
- path with 7 like $env:ProgramFiles\PowerShell\7
- executable with PowerShell 5.1 is powershell.exe and with PowerShell 7 is pwsh.exe
Separate PSModulePath
Separate profiles for each version
- path with 5.1 is $HOME\Documents\WindowsPowerShell
- path with 7 is $HOME\Documents\PowerShell
Improved module compatibility
New remoting endpoints
Group policy support
Separate Event logs

Installation

To install PowerShell 7 on Windows Servers the easiest way is to use a MSI package. The last version to download is the 7.5.0. Once downloaded, double click the msi file and follow the installation:

By default and as mentioned previously, PowerShell 7 will be installed on C:\Program Files\PowerShell\ :

Some customization are possible, we will keep the default selections:

Starting with PowerShell 7.2, it is possible to update PowerShell 7 with traditional Microsoft Update:

After some seconds installation is done:

We can start the PowerShell 7 with the cmd pwsh.exe. As we can see below both versions coexist on my Windows Server:

New features

PowerShell 7 introduces some interesting new features compare to PowerShell 5.

ForEach-Object with parallel execution
Execute the script block in parallel for each object. A parameter ThrottleLimit limits the number of script blocks running at the same time, default value is 5.
Here we search the instance properties by server and limit the parallelization to 2 server at the same time.
$computers = ‘thor90′,’thor91′,’thor10′,’thor11’
$InstanceProperties = $computers | ForEach-Object -Parallel {
$instances = (Find-DbaInstance -ComputerName $_).SqlInstance;
Get-DbaInstanceProperty -SqlInstance $instances
} -ThrottleLimit 2
Ternary operator
A simplified if-else statement with <condition> ? <condition true> : <condition false>
Pipeline chain operators
The && operator executes the right-hand pipeline, if the left-hand pipeline succeeded. Reverse, the || operator executes the right-hand pipeline if the left-hand pipeline failed.
coalescence, assignment and conditional operators
PowerShell 7 includes Null coalescing operator ??, Null conditional assignment ??=, and Null conditional member access operators ?. and ?[]
New management of error message and new cmdlet Get-Error
This new cmdlet Get-Error displays the full detailed of the last error with inner exceptions.
A parameter Newest allows to select the number of error you would like to display

On this blog, post I would to concentrate to the parallelization with the ForEach-Object -Parallel

PowerShell 7 parallelization

This new feature comes with the know ForEach-Object cmdlet which performs an operation on each item in a collection of input objects.
Starting with PowerShell 7.0 a new parameter set, called “Parallel”, gives the possibility to run each script block in parallel instead of sequentially. The “ThrottleLimit” parameter, if used, limits the number of script blocks which will run at the same time, if it is not specified the default value is 5.

ForEach-Object -Parallel <scriptblock> -ThrottleLimit

We can test this new feature with a small example.
If we execute the following script as before, the script block is executed sequentially:

PS C:\Users\administrator.ADSTS> 1..16 | ForEach-Object { Get-Date; sleep 10 }

Friday, February 14, 2025 9:00:02 AM
Friday, February 14, 2025 9:00:12 AM
Friday, February 14, 2025 9:00:22 AM
Friday, February 14, 2025 9:00:32 AM
Friday, February 14, 2025 9:00:42 AM
Friday, February 14, 2025 9:00:52 AM
Friday, February 14, 2025 9:01:02 AM
Friday, February 14, 2025 9:01:12 AM
Friday, February 14, 2025 9:01:22 AM
Friday, February 14, 2025 9:01:32 AM
Friday, February 14, 2025 9:01:42 AM
Friday, February 14, 2025 9:01:52 AM
Friday, February 14, 2025 9:02:02 AM
Friday, February 14, 2025 9:02:12 AM
Friday, February 14, 2025 9:02:22 AM
Friday, February 14, 2025 9:02:32 AM

Each line as 10 seconds more than the previous one.
But, if we execute this script with the new parameter Parallel and use a throttle limit of 4 we have:

PS C:\Users\administrator.ADSTS> 1..16 | ForEach-Object -Parallel { Get-Date; sleep 10 } -ThrottleLimit 4

Friday, February 14, 2025 8:59:01 AM
Friday, February 14, 2025 8:59:01 AM
Friday, February 14, 2025 8:59:01 AM
Friday, February 14, 2025 8:59:01 AM
Friday, February 14, 2025 8:59:11 AM
Friday, February 14, 2025 8:59:11 AM
Friday, February 14, 2025 8:59:11 AM
Friday, February 14, 2025 8:59:11 AM
Friday, February 14, 2025 8:59:21 AM
Friday, February 14, 2025 8:59:21 AM
Friday, February 14, 2025 8:59:21 AM
Friday, February 14, 2025 8:59:21 AM
Friday, February 14, 2025 8:59:31 AM
Friday, February 14, 2025 8:59:31 AM
Friday, February 14, 2025 8:59:31 AM
Friday, February 14, 2025 8:59:31 AM

Here we have 4 groups of 4 lines with the same time as we executed the script block in parallel with limitation of the parallelization to 4.
Of course, the different commands included in the script block are executed sequentially.

This feature uses the PowerShell runspaces to execute script blocks in parallel.
Variables can be passed into the script block with the $using: keyword, the only variable automatically passed is the pipe object.
Each runspace will execute a script block in a thread, so the ThrottleLimit parameter needs to be set according to the number of core of the server where you are running. If you VM has 2 cores, it makes no sense to put the limit to 4…

This new script will execute a maintenance job on different instances of the same server, passing the job name in the block script with the $using: keyword:

PS C:\Users\administrator.ADSTS> $ThrottleLimit = 2
PS C:\Users\administrator.ADSTS> $JobName = 'DBI_MAINTENANCE_MAINTENANCE_USER_DATABASES'
PS C:\Users\administrator.ADSTS> $computers = 'thor90'
PS C:\Users\administrator.ADSTS> $SqlInstances = Find-DbaInstance -ComputerName $computers -EnableException
PS C:\Users\administrator.ADSTS> $SqlInstances

ComputerName InstanceName SqlInstance    Port  Availability Confidence ScanTypes
------------ ------------ -----------    ----  ------------ ---------- ---------
thor90       CMS          thor90\CMS     50074 Available    High       Default
thor90       SQL16_1      thor90\SQL16_1 62919 Available    High       Default
thor90       SQL19_1      thor90\SQL19_1 1433  Available    High       Default
thor90       SQL22_1      thor90\SQL22_1 50210 Available    High       Default
thor90       MSSQLSERVER  thor90         1434  Available    High       Default

PS C:\Users\administrator.ADSTS> $SqlInstances | ForEach-Object -Parallel {
>>     $Out = "Starting Job on $using:JobName [" + $_.SqlInstance + "]"
>>     Write-Host $Out
>>     $res = Start-DbaAgentJob -SqlInstance $_.SqlInstance -Job $using:JobName -Wait
>> } -ThrottleLimit $ThrottleLimit
Starting Job on DBI_MAINTENANCE_MAINTENANCE_USER_DATABASES [thor90\CMS]
Starting Job on DBI_MAINTENANCE_MAINTENANCE_USER_DATABASES [thor90\SQL16_1]
Starting Job on DBI_MAINTENANCE_MAINTENANCE_USER_DATABASES [thor90\SQL19_1]
Starting Job on DBI_MAINTENANCE_MAINTENANCE_USER_DATABASES [thor90\SQL22_1]
Starting Job on DBI_MAINTENANCE_MAINTENANCE_USER_DATABASES [thor90]
PS C:\Users\administrator.ADSTS>

We use this kind of script at a customer place to execute SQL Server Agent Jobs in parallel on instances of a big physical servers with more than 10 instances.

Conclusion

This PowerShell 7 parallelization feature can improve performance in lots of different scenarios. But test it and don’t think that because of parallelization all your scripts will be executed quickly as running a script in parallel adds some overhead which will decrease execution of trivial script.

L’article Starting with PowerShell 7 and parallelization est apparu en premier sur dbi Blog.

pg_mooncake: (another) Columnar storage for PostgreSQL

Tue, 2025-02-25 00:42

A very, very long time ago I’ve written a blog about cstore_fdw, which brings columnar storage to PostgreSQL. This is now part of Citus and does not anymore come as a separate extension. While this still can be used there is another option if you’re looking for columnar storage in PostgreSQL, and this is pg_mooncake. The goal of this extension is to optimize PostgreSQL for analytics and stores the tables in Iceberg or Delta Lake format, either on local disk or on cloud storage such as s3.

Getting the extension onto the system is pretty simple and straight forward (as you can see below it uses DuckDB in the background), but it will take some time to compile and you need to have Cargo installed for this to succeed:

postgres@pgbox:/home/postgres/ [172] which pg_config
/u01/app/postgres/product/17/db_2/bin/pg_config
postgres@pgbox:/home/postgres/ [172] git clone https://github.com/Mooncake-Labs/pg_mooncake.git
Cloning into 'pg_mooncake'...
remote: Enumerating objects: 1085, done.
remote: Counting objects: 100% (533/533), done.
remote: Compressing objects: 100% (250/250), done.
remote: Total 1085 (delta 406), reused 284 (delta 283), pack-reused 552 (from 2)
Receiving objects: 100% (1085/1085), 728.27 KiB | 3.98 MiB/s, done.
Resolving deltas: 100% (631/631), done.
postgres@pgbox:/home/postgres/ [172] cd pg_mooncake/
postgres@pgbox:/home/postgres/pg_mooncake/ [172] git submodule update --init --recursive
Submodule 'third_party/duckdb' (https://github.com/duckdb/duckdb.git) registered for path 'third_party/duckdb'
Cloning into '/home/postgres/pg_mooncake/third_party/duckdb'...
Submodule path 'third_party/duckdb': checked out '19864453f7d0ed095256d848b46e7b8630989bac'
postgres@pgbox:/home/postgres/pg_mooncake/ [172] make release -j2
...
[ 23%] Building CXX object src/common/types/column/CMakeFiles/duckdb_common_types_column.dir/ub_duckdb_common_types_column.cpp.o
[ 23%] Built target duckdb_common_types_column
[ 23%] Building CXX object src/common/types/row/CMakeFiles/duckdb_common_types_row.dir/ub_duckdb_common_types_row.cpp.o
[ 23%] Built target duckdb_common_types_row
[ 23%] Building CXX object src/common/value_operations/CMakeFiles/duckdb_value_operations.dir/ub_duckdb_value_operations.cpp.o
[ 23%] Built target duckdb_common_types
...
postgres@pgbox:/home/postgres/pg_mooncake/ [172] make install
/usr/bin/mkdir -p '/u01/app/postgres/product/17/db_2/lib'
/usr/bin/mkdir -p '/u01/app/postgres/product/17/db_2/share/extension'
/usr/bin/mkdir -p '/u01/app/postgres/product/17/db_2/share/extension'
/usr/bin/mkdir -p '/u01/app/postgres/product/17/db_2/lib'
/usr/bin/install -c -m 755  pg_mooncake.so '/u01/app/postgres/product/17/db_2/lib/pg_mooncake.so'
/usr/bin/install -c -m 644 .//../../pg_mooncake.control '/u01/app/postgres/product/17/db_2/share/extension/'
/usr/bin/install -c -m 644 .//../../sql/pg_mooncake--0.1.0.sql .//../../sql/pg_mooncake--0.1.0--0.1.1.sql .//../../sql/pg_mooncake--0.1.1--0.1.2.sql  '/u01/app/postgres/product/17/db_2/share/extension/'
/usr/bin/install -c -m 755  libduckdb.so '/u01/app/postgres/product/17/db_2/lib/'

On this is compiled and installed simply add the extension to a database:

postgres=$ create extension pg_mooncake;
CREATE EXTENSION
postgres=$ \dx
                                            List of installed extensions
        Name        | Version |   Schema   |                              Description                               
--------------------+---------+------------+------------------------------------------------------------------------
 pg_mooncake        | 0.1.2   | public     | Columnstore Table in Postgres
 pg_stat_statements | 1.11    | public     | track planning and execution statistics of all SQL statements executed
 plpgsql            | 1.0     | pg_catalog | PL/pgSQL procedural language
(3 rows)

To compare this table layout against the standard PostgreSQL layout, let’s create two tables, one of them using the column store format:

postgres=# create table t1 ( a int, b text );
CREATE TABLE
postgres=# create table t2 ( a int, b text ) using columnstore;
CREATE TABLE
postgres=# \d+ t1
                                            Table "public.t1"
 Column |  Type   | Collation | Nullable | Default | Storage  | Compression | Stats target | Description 
--------+---------+-----------+----------+---------+----------+-------------+--------------+-------------
 a      | integer |           |          |         | plain    |             |              | 
 b      | text    |           |          |         | extended |             |              | 
Access method: heap

postgres=# \d+ t2
                                            Table "public.t2"
 Column |  Type   | Collation | Nullable | Default | Storage  | Compression | Stats target | Description 
--------+---------+-----------+----------+---------+----------+-------------+--------------+-------------
 a      | integer |           |          |         | plain    |             |              | 
 b      | text    |           |          |         | extended |             |              | 
Access method: columnstore

Adding a millions rows to both of the tables:

postgres=# insert into t1 select i, md5(i::text) from generate_series(1,1000000) i;
INSERT 0 1000000
postgres=# insert into t2 select i, md5(i::text) from generate_series(1,1000000) i;
INSERT 0 1000000

Looking at the disk we can see the first surprise:

postgres=# select pg_relation_filepath('t1');
 pg_relation_filepath 
----------------------
 base/5/24715
(1 row)

postgres=# select pg_relation_filepath('t2');
 pg_relation_filepath 
----------------------
 base/5/24718
(1 row)

postgres=# select pg_size_pretty ( pg_relation_size ( 't1' ));
 pg_size_pretty 
----------------
 65 MB
(1 row)

postgres=# select pg_size_pretty ( pg_relation_size ( 't2' ));
 pg_size_pretty 
----------------
 0 bytes
(1 row)

The table in columnar format reports a size of zero bytes, and indeed there is not even a file on disk which corresponds to 24718:

postgres=# \! ls -la $PGDATA/base/5/24715
-rw-------. 1 postgres postgres 68272128 Feb 24 14:19 /u02/pgdata/17/base/5/24715
postgres=# \! ls -la $PGDATA/base/5/24718
ls: cannot access '/u02/pgdata/17/base/5/24718': No such file or directory

Instead the table is stored here as Parquet files:

 postgres@pgbox:/u02/pgdata/17/ [172] ls -la $PGDATA/mooncake_local_tables/mooncake_postgres_t2_24708/
total 107224
drwx------. 3 postgres postgres      180 Feb 24 14:31 .
drwx------. 3 postgres postgres       40 Feb 24 14:15 ..
-rw-------. 1 postgres postgres 36596064 Feb 24 14:19 560c6efe-1226-4a76-985f-1301169bcc44.parquet
-rw-------. 1 postgres postgres 36596064 Feb 24 14:18 ca0550d6-bd84-4bf9-b8cf-6ce85a65346b.parquet
drwx------. 2 postgres postgres     4096 Feb 24 14:19 _delta_log
-rw-------. 1 postgres postgres 36596064 Feb 24 14:19 fba0eff4-3c57-4dbb-bd9b-469f6622ab92.parquet

Here is more detailed blog about the design decisions.

The difference is also visible when you look at the explain plans against both tables:

postgres=# explain (analyze) select * from t1 where a = 1;
                                                   QUERY PLAN                                                    
-----------------------------------------------------------------------------------------------------------------
 Gather  (cost=1000.00..14542.43 rows=1 width=37) (actual time=1.591..41.014 rows=1 loops=1)
   Workers Planned: 2
   Workers Launched: 2
   ->  Parallel Seq Scan on t1  (cost=0.00..13542.33 rows=1 width=37) (actual time=0.022..11.067 rows=0 loops=3)
         Filter: (a = 1)
         Rows Removed by Filter: 333333
 Planning Time: 0.861 ms
 Execution Time: 41.086 ms
(8 rows)


postgres=# explain (analyze) select * from t2 where a = 1;
                                                  QUERY PLAN                                                  
--------------------------------------------------------------------------------------------------------------
 Custom Scan (MooncakeDuckDBScan)  (cost=0.00..0.00 rows=0 width=0) (actual time=7.797..7.816 rows=1 loops=1)
   DuckDB Execution Plan: 
 
 ┌─────────────────────────────────────┐
 │┌───────────────────────────────────┐│
 ││    Query Profiling Information    ││
 │└───────────────────────────────────┘│
 └─────────────────────────────────────┘
 EXPLAIN ANALYZE SELECT a, b FROM pgmooncake.public.t2 WHERE (a = 1)
 ┌────────────────────────────────────────────────┐
 │┌──────────────────────────────────────────────┐│
 ││              Total Time: 0.0043s             ││
 │└──────────────────────────────────────────────┘│
 └────────────────────────────────────────────────┘
 ┌───────────────────────────┐
 │           QUERY           │
 └─────────────┬─────────────┘
 ┌─────────────┴─────────────┐
 │      EXPLAIN_ANALYZE      │
 │    ────────────────────   │
 │           0 Rows          │
 │          (0.00s)          │
 └─────────────┬─────────────┘
 ┌─────────────┴─────────────┐
 │         TABLE_SCAN        │
 │    ────────────────────   │
 │         Function:         │
 │      COLUMNSTORE_SCAN     │
 │                           │
 │        Projections:       │
 │             a             │
 │             b             │
 │                           │
 │          Filters:         │
 │   a=1 AND a IS NOT NULL   │
 │                           │
 │           1 Rows          │
 │          (0.00s)          │
 └───────────────────────────┘
 
 
 Planning Time: 1.693 ms
 Execution Time: 8.584 ms
(43 rows)

As usual with columnar data storage, is is best when you have data which can be compressed well on a columnar basis, e.g.:

postgres=# truncate t1,t2;
TRUNCATE TABLE
postgres=# insert into t1 select 1, md5(i::text) from generate_series(1,1000000) i;
INSERT 0 1000000
Time: 852.812 ms
postgres=# insert into t2 select 1, md5(i::text) from generate_series(1,1000000) i;
INSERT 0 1000000
Time: 243.532 ms

The insert into the t2 is consistently faster than the insert into the standard tables (just repeat the inserts multiple times to get an idea). The same happens when you read columns which are compressed well (same here, just repeat the query multiple times to get an idea):

postgres=# select count(a) from t1;
  count  
---------
 2000000
(1 row)

Time: 60.463 ms
postgres=# select count(a) from t2;
  count  
---------
 2000000
(1 row)

Time: 10.272 ms

This might be an option if you have use cases for this.

L’article pg_mooncake: (another) Columnar storage for PostgreSQL est apparu en premier sur dbi Blog.

Updating SUSE/ SLES the right way

Mon, 2025-02-24 10:54

We all had that time where we needed to upgrade a ages old system that was running for years without any interruption. This sometimes can be quite tricky and take some time and nerves to do correctly. But since SUSE rocks its amazing YaST (Yet Another Setup Tool), you can nail this setup without deep shell skills. In this Blog post I will try to explain you the steps one has to take to prepare a SUSE Enterprise Linux for Upgrading it. To make it more relatable I will document both the Major-Upgrade and the Minor-Upgrade. Additionally I will explain the biggest issues one might face and give you some hints in how to solve them.

Prepare your SUSE/ SLES

The fact that OpenSUSE does not need any subscription means, that there is no need for cleanup at this point of my Blog. For all users that are using OpenSUSE: just jump to the Chapter 3.

Chapter 1: Cleanup of unused Modules or Add-ons

Before we start migrating or activating the system, ensure no unused or unneeded modules or add-ons linger on it. These could stop us from activating the system if our license doesn’t cover them.

To check for unneeded modules, open YaST and select ‘Software/Add-On Products.’ This lists all Add-On Products deployed on your system.

If the URL is defined as ‘Unknown URL,’ this Add-on either isn’t available anymore or changed its URL in future releases. Delete all Add-ons with missing URLs and check the ‘Software Manager’ in the bottom-right corner. Use the Software Manager to see if deleting an Add-on removes any packages.

Chapter 2: Activating and preparing SLES Subscription

This step is done in YAST or on the shell.

Shell:

# Check the Subscription of your SLES machine with SUSEConnect
$ SUSEConnect --status-text
# This will generate a list of all Repositories and Add-Ons used on your system and inform you about their state (Not Registered or Registered)
# Here is a example Output:
$ SUSEConnect --status-text | grep -i "not Registered" -B 3
SUSE Linus Enterprise Live Patching 
(sle-module-live-patching/15.2/x86_64)
Not Registered
--
Python 2 Module
(sle-module-python2/15.2/x86_64)
Not Registered
--
Desktop Applications Module
(sle-module-desktop-applications/15.2/x86_64)
Not Registered
--
HPC Module
(sle-module-hpc/15.2/x86_64)
Not Registered
--
SUSE Linux Enterprise Workstation Extension 15 SP2 
(sle-we/15.2/x86_64)
Not Registered

# The most important module on a SLES server is "sle-module-basesystem". This module has to be activated to be able to use the official SLES repositories.
# To activate the Base System Module run:
$ SUSEConnect -r ABCD1EF2GHIJKL34
Registering system to SUSE Customer Center
Updating system details on https://scc.suse.com...
Activating SLES 15.2 x86_64...
-> Adding service to svstem...
Activating sle-module-basesystem 15.2 x86 64...
-> Adding service to system...
-> Installing release package...
Activating sle-module-server-applications 15.2 286 64...
-> Adding service to svstem...
-> Installing release package...
Successfully registered system

# To activate specific modules you can run the SUSEConnect commands like this:
# The module name can be found inside the () when executing "SUSEConnect --status-text" 
$ SUSEConnect -p sle-we/15.2/x86_64 -r ABCD1EF2GHIJKL34

# In case you have issues with activating or connecting to the SUSE activation server "https://scc.suse.com" check the File /etc/SUSEConnect and/ or add a proxy inside your /etc/sysconfig/proxy:
$ cat /etc/SUSEConnect
---
url: https://scc.suse.com
insecure: false

# If your network is being limited or closed for http/ https connections add the following proxy settings inside the /etc/sysconfig/proxy file:
$ vi /etc/sysconfig/proxy
# Enable proxy forward
PROXY_ENABLED="yes"
# Configure a http proxy:
HTTP_PROXY="http://192.168.0.1:3128"
# Or a https proxy:
HTTPS_PROXY="http://192.168.0.1:3128"
# Here you can define which connections should not go over the proxy:
NO_PROXY="localhost, 127.0.0.1"

YaST:

The whole setup can be done via YaST too. So in case you don’t feel very comfortable inside the shell this is by far the most useful tool for system setup. To activate the SLES Base-System just start yast via GUI or run “yast2” in your shell:

Choose Software/ Product Registration:

Then choose “Register Again” to register your Base-System. Type in the “E-Mail Address” owning the SUSE-License and type in your “Registration Code”:

Afterwards you can choose “Select Extensions” to activate additional extensions:

If your registration code permits it, pick extra modules to use in your system:

Chapter 3: Online Migration (Minor Upgrade)

Now we can finally start with the Upgrade Process. Before jumping into execution it is smart to check the possibilities and upgrade paths that matches your systems version. You can find this in the official SUSE documentation: https://documentation.suse.com/sles/15-SP5/html/SLES-all/cha-upgrade-paths.html

In my example we will be doing a Online Upgrade from SLES 15-SP2 to SLES 15-SP5. You can tackle this step either on the shell with the ‘zypper migration’ command or through YaST.

Shell:

To kick off the online migration on the shell, run ‘zypper migration’—then pick the version you want to install when prompted. Before starting the migration zypper will also make sure to update all repositories and patch all security patches for your current version.

HINT: In some cases, disable interaction for the migration to keep the process rolling if a package demands a license agreement.

# Use zypper without any interruption by product license agreement:
$ zypper migration --auto-agree-with-product-licenses 

# Use zypper without any interruption at all:
$ zypper migration --non-interactive

This will look something like this:

YaST:

After the cleanup and the server activation, we can now open YaST and go to “Software/ Online Migration”. There, a short update of the repositories greets us, and we’ll see a list of possible upgrades:

In our case we will choose the (at that time) newest release SLES 15 SP5.

As of the end of January 2025, the latest release is SUSE Linux Enterprise Server (SLES) 15 SP6. Use this version — I recommend it since support for each service pack usually ends six months after the new one drops.

# It is possible to use YAST without interruption as well, this is suggested in bigger upgrades like here:
$ vi /etc/zypp/zypp.conf

# Add the parameter: 
$ autogreeWithLicenses = yes

Chapter 4: Offline Migration

When trying to upgrade any pre-SLES 15 system (like SLES 11 or 12), there is only the offline upgrade path available. This might sound harder than the one-click online migration, but in reality, it’s almost as easy as using YaST. The only difference is that you need to boot the server with the desired ISO version. In our example, my SLES 12 SP5 will need to be booted with the SLES 15 SP6 ISO. Depending on your server (and whether it’s bare metal or a VM), the specifics might vary case by case. In my case, I’m using a self-built virtualization server with OpenSUSE and running a KVM Server. The only thing I need to do to boot my SLES 15 SP6 ISO, is to mount the ISO file to my VM and disable secure boot. After that, reboot the VM and enter the “Upgrade” tab from your ISO. This will then start the step-by-step migration of your system.

After booting the ISO you will be greeted with these options. Choose “Update” to in-place upgrade your SUSE Enterprise Linux to the version of your ISO.

In case you use any other Filesystem than btrfs, use the checkbox “show All Partitions”. This will list even LVM volumes and filesystems.

II strongly suggest you check all the products you plan to install. If some aren’t needed or are missing, tick the ‘Manually Adjust the Repositories for Migration’ checkbox.

This step reminds us that the old SLES 12 SP5 repositories will be removed after the upgrade.

And the final step is a summary which we need to check carefully if all the settings are correct. After that the server will install the brand new release and boot it up after the installation process.

L’article Updating SUSE/ SLES the right way est apparu en premier sur dbi Blog.

OpenStack – Deploying the first compute instance

Mon, 2025-02-24 06:02

Once you have your OpenStack playground up and running it is time to actually use it. What we want to do today is to create our first compute instance. Before we can do that, we again need a bit of preparation. If you know AWS EC2, or other public cloud services when it comes to compute, you probably know the concept of “flavors”, or “instance types”, or “shapes”, or whatever the wording. No matter how it is called, this is all about defining base metrics for an instance to be launched, such as the number of CPUs/Cores or the amount of memory to assign to them.

With OpenStack this is not much different. Looking at the “flavors” we currently have available we’ll notice that there is nothing defined yet. You can either look at that from the command line:

[root@controller ~]$ . admin-openrc 
[root@controller ~]$ openstack flavor list

[root@controller ~]$

… or from the Horizon dashboard:

From now on, we’ll used the command line but you’ll see some screenshots of the dashboard when it makes sense. If you prefer using the dashboard, go ahead and manage it from there.

We currently have one image available to deploy instances from, and this is the CirrOS image we’ve uploaded when we created the playground (see the link on top of this post if you didn’t follow that). This image is really small, so we can create a really small “flavor” for this:

[root@controller ~]$ openstack flavor create --id 0 --vcpus 1 --ram 64 --disk 1 m1.nano
+----------------------------+---------+
| Field                      | Value   |
+----------------------------+---------+
| OS-FLV-DISABLED:disabled   | False   |
| OS-FLV-EXT-DATA:ephemeral  | 0       |
| description                | None    |
| disk                       | 1       |
| id                         | 0       |
| name                       | m1.nano |
| os-flavor-access:is_public | True    |
| properties                 |         |
| ram                        | 64      |
| rxtx_factor                | 1.0     |
| swap                       | 0       |
| vcpus                      | 1       |
+----------------------------+---------+
[root@controller ~]$ openstack flavor list
+----+---------+-----+------+-----------+-------+-----------+
| ID | Name    | RAM | Disk | Ephemeral | VCPUs | Is Public |
+----+---------+-----+------+-----------+-------+-----------+
| 0  | m1.nano |  64 |    1 |         0 |     1 | True      |
+----+---------+-----+------+-----------+-------+-----------+

When it comes to connecting to the VM we’ll deploy in a couple of minutes later on, you should have a SSH keypair (again, very much the same as you know it from public cloud compute services). To create that with OpenStack, create a standard SSH key, and then upload it into OpenStack:

[root@controller ~]$ ssh-keygen -q -N ""
Enter file in which to save the key (/root/.ssh/id_rsa): 
[root@controller ~]$ ls -a .ssh/
.  ..  id_rsa  id_rsa.pub
[root@controller ~]$ openstack keypair create --public-key ~/.ssh/id_rsa.pub demokey
+-------------+-------------------------------------------------+
| Field       | Value                                           |
+-------------+-------------------------------------------------+
| created_at  | None                                            |
| fingerprint | f9:ef:2f:7d:c3:6c:99:17:0e:63:3b:f8:b0:75:aa:85 |
| id          | demokey                                         |
| is_deleted  | None                                            |
| name        | demokey                                         |
| type        | ssh                                             |
| user_id     | 3d6998879b6c4fdd91ba4b6ec00b7157                |
+-------------+-------------------------------------------------+
[root@controller ~]$ openstack keypair list
+---------+-------------------------------------------------+------+
| Name    | Fingerprint                                     | Type |
+---------+-------------------------------------------------+------+
| demokey | f9:ef:2f:7d:c3:6c:99:17:0e:63:3b:f8:b0:75:aa:85 | ssh  |
+---------+-------------------------------------------------+------+

The next bit we need to get right for being able to access the instance are security groups. You can think of them like firewalls allowing or denying access for specific ports and protocols. Current we have one security group available, the default one:

[root@controller ~]$ openstack security group list
+--------------------------------------+---------+------------------------+----------------------------------+------+
| ID                                   | Name    | Description            | Project                          | Tags |
+--------------------------------------+---------+------------------------+----------------------------------+------+
| 52a3b8d0-9490-4f85-8692-e7875f744bc9 | default | Default security group | 920bf34a6c88454f90d405124ca1076d | []   |
+--------------------------------------+---------+------------------------+----------------------------------+------+

This group does not have any rules defined to allow access over ssh:

[root@controller ~]$ openstack security group rule list default
+--------------------------------------+-------------+-----------+-----------+------------+-----------+--------------------------------------+----------------------+
| ID                                   | IP Protocol | Ethertype | IP Range  | Port Range | Direction | Remote Security Group                | Remote Address Group |
+--------------------------------------+-------------+-----------+-----------+------------+-----------+--------------------------------------+----------------------+
| 09b76b42-1cb6-437b-b7dc-f6b947d5cf88 | None        | IPv6      | ::/0      |            | ingress   | 52a3b8d0-9490-4f85-8692-e7875f744bc9 | None                 |
| 461734e3-113d-4b87-a266-c448eb407f18 | None        | IPv4      | 0.0.0.0/0 |            | egress    | None                                 | None                 |
| cf174f68-9e67-4299-8f0a-af9f4c250005 | None        | IPv4      | 0.0.0.0/0 |            | ingress   | 52a3b8d0-9490-4f85-8692-e7875f744bc9 | None                 |
| e23e72d7-a432-4330-9a0b-ee74c9d61661 | None        | IPv6      | ::/0      |            | egress    | None                                 | None                 |
+--------------------------------------+-------------+-----------+-----------+------------+-----------+--------------------------------------+----------------------+

To allow this, we need to add the rules:

[root@controller ~]$ openstack security group rule create --proto tcp --dst-port 22 default
+-------------------------+--------------------------------------+
| Field                   | Value                                |
+-------------------------+--------------------------------------+
| belongs_to_default_sg   | True                                 |
| created_at              | 2025-01-27T08:51:06Z                 |
| description             |                                      |
| direction               | ingress                              |
| ether_type              | IPv4                                 |
| id                      | a3bc97d7-8ba1-4f8a-a87f-28e14c2b6883 |
| name                    | None                                 |
| normalized_cidr         | 0.0.0.0/0                            |
| port_range_max          | 22                                   |
| port_range_min          | 22                                   |
| project_id              | 920bf34a6c88454f90d405124ca1076d     |
| protocol                | tcp                                  |
| remote_address_group_id | None                                 |
| remote_group_id         | None                                 |
| remote_ip_prefix        | 0.0.0.0/0                            |
| revision_number         | 0                                    |
| security_group_id       | 52a3b8d0-9490-4f85-8692-e7875f744bc9 |
| tags                    | []                                   |
| updated_at              | 2025-01-27T08:51:06Z                 |
+-------------------------+--------------------------------------+
[root@controller ~]$ openstack security group rule list default
+--------------------------------------+-------------+-----------+-----------+------------+-----------+--------------------------------------+----------------------+
| ID                                   | IP Protocol | Ethertype | IP Range  | Port Range | Direction | Remote Security Group                | Remote Address Group |
+--------------------------------------+-------------+-----------+-----------+------------+-----------+--------------------------------------+----------------------+
| 09b76b42-1cb6-437b-b7dc-f6b947d5cf88 | None        | IPv6      | ::/0      |            | ingress   | 52a3b8d0-9490-4f85-8692-e7875f744bc9 | None                 |
| 461734e3-113d-4b87-a266-c448eb407f18 | None        | IPv4      | 0.0.0.0/0 |            | egress    | None                                 | None                 |
| a3bc97d7-8ba1-4f8a-a87f-28e14c2b6883 | tcp         | IPv4      | 0.0.0.0/0 | 22:22      | ingress   | None                                 | None                 |
| cf174f68-9e67-4299-8f0a-af9f4c250005 | None        | IPv4      | 0.0.0.0/0 |            | ingress   | 52a3b8d0-9490-4f85-8692-e7875f744bc9 | None                 |
| e23e72d7-a432-4330-9a0b-ee74c9d61661 | None        | IPv6      | ::/0      |            | egress    | None                                 | None                 |
+--------------------------------------+-------------+-----------+-----------+------------+-----------+--------------------------------------+----------------------+

If you want a ping to success as well, add it:

[root@controller ~]$ openstack security group rule create --proto icmp default
+-------------------------+--------------------------------------+
| Field                   | Value                                |
+-------------------------+--------------------------------------+
| belongs_to_default_sg   | True                                 |
| created_at              | 2025-01-27T08:52:56Z                 |
| description             |                                      |
| direction               | ingress                              |
| ether_type              | IPv4                                 |
| id                      | 246aab68-9a8c-47f4-9554-0ea9c88ef18d |
| name                    | None                                 |
| normalized_cidr         | 0.0.0.0/0                            |
| port_range_max          | None                                 |
| port_range_min          | None                                 |
| project_id              | 920bf34a6c88454f90d405124ca1076d     |
| protocol                | icmp                                 |
| remote_address_group_id | None                                 |
| remote_group_id         | None                                 |
| remote_ip_prefix        | 0.0.0.0/0                            |
| revision_number         | 0                                    |
| security_group_id       | 52a3b8d0-9490-4f85-8692-e7875f744bc9 |
| tags                    | []                                   |
| updated_at              | 2025-01-27T08:52:56Z                 |
+-------------------------+--------------------------------------+
[root@controller ~]$ openstack security group rule list default | egrep "tcp|icmp"
| 246aab68-9a8c-47f4-9554-0ea9c88ef18d | icmp        | IPv4      | 0.0.0.0/0 |            | ingress   | None                                 | None                 |
| a3bc97d7-8ba1-4f8a-a87f-28e14c2b6883 | tcp         | IPv4      | 0.0.0.0/0 | 22:22      | ingress   | None                                 | None                 |

Now we’re ready to deploy our first instance. As we have configured a provider network, this is the network we need to use:

[root@controller ~]$ openstack network list
+--------------------------------------+----------+--------------------------------------+
| ID                                   | Name     | Subnets                              |
+--------------------------------------+----------+--------------------------------------+
| aa8bd4f9-4d89-4c7f-803c-c56aaf8f8f57 | provider | 77ba8f00-edeb-4555-8c2a-be48b24f0320 |
+--------------------------------------+----------+--------------------------------------+

To deploy the instance provide the flavor, the image ID, the network ID, the ssh key, the security group, and a name to the instance:

[root@controller ~]$ . admin-openrc 
[root@controller ~]$ openstack server create --flavor m1.nano --image cirros --nic net-id=f0183177-cef2-4e56-8c61-15ae96636ba1 --security-group default --key-name demokey  --wait  "demo-instance"

+-------------------------------------+---------------------------------------------------------------------------------------------+
| Field                               | Value                                                                                       |
+-------------------------------------+---------------------------------------------------------------------------------------------+
| OS-DCF:diskConfig                   | MANUAL                                                                                      |
| OS-EXT-AZ:availability_zone         | nova                                                                                        |
| OS-EXT-SRV-ATTR:host                | compute.it.dbi-services.com                                                                 |
| OS-EXT-SRV-ATTR:hostname            | demo-instance                                                                               |
| OS-EXT-SRV-ATTR:hypervisor_hostname | compute.it.dbi-services.com                                                                 |
| OS-EXT-SRV-ATTR:instance_name       | instance-0000000c                                                                           |
| OS-EXT-SRV-ATTR:kernel_id           | None                                                                                        |
| OS-EXT-SRV-ATTR:launch_index        | None                                                                                        |
| OS-EXT-SRV-ATTR:ramdisk_id          | None                                                                                        |
| OS-EXT-SRV-ATTR:reservation_id      | r-3lautvxl                                                                                  |
| OS-EXT-SRV-ATTR:root_device_name    | /dev/vda                                                                                    |
| OS-EXT-SRV-ATTR:user_data           | None                                                                                        |
| OS-EXT-STS:power_state              | Running                                                                                     |
| OS-EXT-STS:task_state               | None                                                                                        |
| OS-EXT-STS:vm_state                 | active                                                                                      |
| OS-SRV-USG:launched_at              | 2025-02-03T07:53:27.000000                                                                  |
| OS-SRV-USG:terminated_at            | None                                                                                        |
| accessIPv4                          | None                                                                                        |
| accessIPv6                          | None                                                                                        |
| addresses                           | provider=172.22.19.8                                                                        |
| adminPass                           | 9ZYMUC9HULnv                                                                                |
| config_drive                        | None                                                                                        |
| created                             | 2025-02-03T06:52:55Z                                                                        |
| description                         | None                                                                                        |
| flavor                              | description=, disk='1', ephemeral='0', , id='m1.nano', is_disabled=, is_public='True',      |
|                                     | location=, name='m1.nano', original_name='m1.nano', ram='64', rxtx_factor=, swap='0',       |
|                                     | vcpus='1'                                                                                   |
| hostId                              | 9edd54905ea9dce4189babcb1424ea5cc082dea3f6d876bfd5749446                                    |
| host_status                         | UP                                                                                          |
| id                                  | 19ed87ea-b380-4381-9b58-03c8f48cad92                                                        |
| image                               | cirros (654f6256-3b26-4579-99bc-8a2f55690b81)                                               |
| key_name                            | demokey                                                                                     |
| locked                              | None                                                                                        |
| locked_reason                       | None                                                                                        |
| name                                | demo-instance                                                                               |
| pinned_availability_zone            | None                                                                                        |
| progress                            | None                                                                                        |
| project_id                          | da47f1e805ef4f5c95ede8daab5f5f4f                                                            |
| properties                          | None                                                                                        |
| security_groups                     | name='default'                                                                              |
| server_groups                       | None                                                                                        |
| status                              | ACTIVE                                                                                      |
| tags                                |                                                                                             |
| trusted_image_certificates          | None                                                                                        |
| updated                             | 2025-02-03T06:53:29Z                                                                        |
| user_id                             | 92181c8807d541368fdc3ffaee1f0168                                                            |
| volumes_attached                    |                                                                                             |
+-------------------------------------+---------------------------------------------------------------------------------------------+
[root@controller ~]$ openstack server list
+--------------------------------------+---------------+--------+----------------------+--------+---------+
| ID                                   | Name          | Status | Networks             | Image  | Flavor  |
+--------------------------------------+---------------+--------+----------------------+--------+---------+
| 19ed87ea-b380-4381-9b58-03c8f48cad92 | demo-instance | ACTIVE | provider=172.22.19.8 | cirros | m1.nano |
+--------------------------------------+---------------+--------+----------------------+--------+---------+

This triggered the instance creation and you can watch the event with:

[root@controller ~]$ openstack server event list demo-instance
+------------------------------------------+--------------------------------------+--------+----------------------------+
| Request ID                               | Server ID                            | Action | Start Time                 |
+------------------------------------------+--------------------------------------+--------+----------------------------+
| req-69a0028a-f2ef-4c9d-9213-e137c5b09470 | c5e4728d-819e-4c10-b893-7337fd91d606 | create | 2025-01-27T09:57:36.648465 |
+------------------------------------------+--------------------------------------+--------+----------------------------+
[root@controller ~]$ openstack server event show demo-instance req-69a0028a-f2ef-4c9d-9213-e137c5b09470
+------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Field      | Value                                                                                                                                                           |
+------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------+
| action     | create                                                                                                                                                          |
| events     | details=, event='compute__do_build_and_run_instance', finish_time='2025-01-27T09:57:49.761092', host='compute.it.dbi-services.com',                             |
|            | host_id='59222072a40332e03d274c533afc51348f1c4cf81b55bf9e618d7ece', result='Success', start_time='2025-01-27T09:57:40.211637', traceback=                       |
| id         | req-69a0028a-f2ef-4c9d-9213-e137c5b09470                                                                                                                        |
| message    | None                                                                                                                                                            |
| project_id | 920bf34a6c88454f90d405124ca1076d                                                                                                                                |
| request_id | req-69a0028a-f2ef-4c9d-9213-e137c5b09470                                                                                                                        |
| start_time | 2025-01-27T09:57:36.648465                                                                                                                                      |
| user_id    | 3d6998879b6c4fdd91ba4b6ec00b7157                                                                                                                                |
+------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------+

Looking at the log in the dashboard, all looks fine:

Ready to connect and check connectivity to the internet:

[root@controller ~]$ ssh cirros@172.22.19.8
cirros@172.22.19.11's password: 
$ nslookup www.dbi-services.com
Server:    8.8.4.4
Address 1: 8.8.4.4 dns.google

Name:      www.dbi-services.com
Address 1: 2a06:98c1:3120::7
Address 2: 188.114.96.7
Address 3: 188.114.97.7

All fine, so we’re good to use the instance for whatever purpose. If you need access to the console, this can easily be done over the dashboard:

That’s it for the introduction posts to OpenStack. From here on, you should be ready to move on, either by adding additional components such as block or object storage, or harden the environment (for now all is based on http and no traffic is encrypted).

L’article OpenStack – Deploying the first compute instance est apparu en premier sur dbi Blog.

Troubleshooting Oracle Data Guard

Fri, 2025-02-21 04:06

Introduction

Oracle Data Guard is a brilliant component of Oracle Database Enterprise Edition. On its own, it’s a strong argument in favor of Enterprise Edition. This is because Data Guard is a proven Disaster Recovery solution: it’s integrated, reliable and well known by DBAs. And it comes at no additional cost if you don’t need the Active Guard feature. Several days ago, I helped colleagues of mine troubleshooting a Data Guard configuration: let’s summarize how I would proceed to make my standby database back to sync if I have a problem. In this example, I consider having a simple environment with one primary and one standby in Maximum Availability. If you have multiple standby databases, the tasks are the same.

How to check your Data Guard setup?

The easiest way to check your Data Guard setup is by using the Data Guard CLI called the “broker” from one of the servers. The SHOW CONFIGURATION LAG will display 2 key things: your actual configuration (who is primary, who is standby) and the lag between the databases. The lag should be 0 in most cases:

. oraenv <<< DBTEST
dgmgrl / "show configuration lag;"

DGMGRL for Linux: Release 19.0.0.0.0 - Production on Wed Feb 19 15:06:17 2025
Version 19.20.0.0.0

Copyright (c) 1982, 2019, Oracle and/or its affiliates.  All rights reserved.

Welcome to DGMGRL, type "help" for information.
Connected to "DBTEST_IN"
Connected as SYSDG.

Configuration - dtg

  Protection Mode: MaxAvailability
  Members:
  dbtest_in - Primary database
    dbtest_cr - Physical standby database
                Transport Lag:      0 seconds (computed 1 second ago)
                Apply Lag:          0 seconds (computed 1 second ago)

Fast-Start Failover:  Disabled

Configuration Status:
SUCCESS   (status updated 26 seconds ago)

If the lag is bigger than a couple of seconds or if you have errors in your configuration, you will need to do some troubleshooting.

This is an example of a Data Guard configuration not working correctly:

Configuration - dtg

  Protection Mode: MaxAvailability
  Members:
  dbtest_in - Primary database
    Error: ORA-16810: multiple errors or warnings detected for the member

    dbtest_cr - Physical standby database
      Warning: ORA-16854: apply lag could not be determined

Fast-Start Failover:  Disabled

Configuration Status:
ERROR   (status updated 56 seconds ago)

Check the Fast Recovery Area on your standby database

A common problem you may have on a standby database is the Fast Recovery Area (FRA) being full. You probably know that a standby database will apply the changes from the primary without waiting for the archivelog to be shipped, but the archivelog is shipped anyway to the standby. This is because there is no guarantee that the standby database is always up and running, so archivelogs must be transported to the other site for later apply if needed. Another thing is that you probably enabled Flashback Database on both databases, and archivelogs are required for a Flashback Database operation. These shipped archivelogs will naturally land in the FRA, and unless you configured a deletion policy, they will never be deleted. As you probably don’t do backups on the standby database, nothing could flag these archivelogs as reclaimable (meaning that they are now useless and can be deleted). If your FRA is quite big, you may discover this problem several weeks or months after the initial setup.

Check the FRA usage on your standby with this query:

select  sum(PERCENT_SPACE_USED-PERCENT_SPACE_RECLAIMABLE) "Real FRA usage %" from v$flash_recovery_area_usage;
Real FRA usage %
----------------
           32.86

If the FRA is almost full, you can remove older archivelogs, for example those older than 2 days if your standby has a 1-day lag:

rman target /
delete force noprompt archivelog all completed before 'sysdate-2';
exit;

Then check again the FRA and the lag of your Data Guard setup.

Check standby_file_management parameter

Another thing that can break your sync is the standby_file_management parameter having an incorrect value. In most cases, it must be set to AUTOMATIC: it means that any file created on the primary will be created on the standby. It’s the way it is supposed to work:

show parameter standby_file_management
NAME				     TYPE	 VALUE
------------------------------------ ----------- ------------------------------
standby_file_management 	     string	 AUTO

If standby_file_management is set to MANUAL, your standby will not be in sync anymore as soon as a new datafile is created on the primary. You will need to manually create the file on the standby to continue the sync. Not very convenient.

The MANUAL mode exists because some older configurations had different filesystems on primary and on standby and didn’t use OMF, meaning that standby database cannot guess where to put the new file.

Both primary and standby databases must have this parameter set to AUTO nowadays.

Cross test connexions

Your Data Guard setup can only work if your databases are able to communicate together. If you are not so sure if something changed on your network, for example a new firewall rule, check your connexions from both servers using the SYS account.

From your primary:

sqlplus sys@DBTEST_CR as sysdba
sqlplus sys@DBTEST_IN as sysdba

From your standby:

sqlplus sys@DBTEST_CR as sysdba
sqlplus sys@DBTEST_IN as sysdba

Check password file

When your standby database is MOUNTED, the only way to authenticate the SYS user is by using the password file. If you changed the SYS password on the primary, it will be changed inside the database (and replicated) as well as in the local password file, but the password file on the standby site won’t be updated. You must then copy the password file from the primary database to the standby database. Copy is done with a scp command, for example from my primary server:


srvctl config database -db DBTEST_IN | grep Password
Password file: /u01/app/odaorahome/oracle/product/19.0.0.0/dbhome_4/dbs/orapwDBTEST

scp `srvctl config database -db OP1 | grep Password | awk '{print $3;}'` oracle@oda-x11-cr:`srvctl config database -db OP1 | grep Password | awk '{print $3;}'`

A restart of the standby database may be needed.

Check alert_DBTEST.log and drcDBTEST.log on both servers

Never miss an error reported in the alert_DBTEST.log on both sides. I would recommend disabling the Data Guard configuration, doing a tail -f on both alert_DBTEST.log files, and enabling back the configuration:

. oraenv <<< DBTEST
dgmgrl / "disable configuration;"
sleep 60
dgmgrl / "enable configuration;"

There are also dedicated trace files for Data Guard, at the same place as alert_DBTEST.log: drcDBTEST.log. You may find additional information for troubleshooting your configuration in these files.

Remove and recreate the configuration

Data Guard configuration is just a couple of parameters stored in a file on both sides. It’s easy to drop and create again this configuration without actually rebuilding the standby database. If you want to make sure that nothing survives from your old configuration, just remove the broker files before creating the configuration again.

From the primary:

dgmgrl sys
edit configuration set protection mode as maxperformance;
remove configuration;
exit;

sqlplus / as sysdba
sho parameter dg_broker_config_file

NAME				     TYPE	 VALUE
------------------------------------ ----------- ------------------------------
dg_broker_config_file1		     string	 /u01/app/odaorahome/oracle/pro
						 duct/19.0.0.0/dbhome_4/dbs/dr1
						 DBTEST_IN.dat
dg_broker_config_file2		     string	 /u01/app/odaorahome/oracle/pro
						 duct/19.0.0.0/dbhome_4/dbs/dr2
						 DBTEST_IN.dat
alter system set dg_broker_start=FALSE;

host rm /u01/app/odaorahome/oracle/product/19.0.0.0/dbhome_4/dbs/dr1DBTEST_IN.dat
host rm /u01/app/odaorahome/oracle/product/19.0.0.0/dbhome_4/dbs/dr2DBTEST_IN.dat

exit

From the standby:

sqlplus / as sysdba
sho parameter dg_broker_config_file

NAME				     TYPE	 VALUE
------------------------------------ ----------- ------------------------------
dg_broker_config_file1		     string	 /u01/app/odaorahome/oracle/pro
						 duct/19.0.0.0/dbhome_4/dbs/dr1
						 DBTEST_CR.dat
dg_broker_config_file2		     string	 /u01/app/odaorahome/oracle/pro
						 duct/19.0.0.0/dbhome_4/dbs/dr2
						 DBTEST_CR.dat
alter system set dg_broker_start=FALSE;

host rm /u01/app/odaorahome/oracle/product/19.0.0.0/dbhome_4/dbs/dr1DBTEST_CR.dat
host rm /u01/app/odaorahome/oracle/product/19.0.0.0/dbhome_4/dbs/dr2DBTEST_CR.dat

alter system set dg_broker_start=TRUE;
exit

From the primary:

sqlplus / as sysdba
alter system set dg_broker_start=TRUE;
exit;

dgmgrl sys
create configuration DTG as primary database is 'DBTEST_IN' connect identifier is 'DBTEST_IN';
add database 'DBTEST_CR' as connect identifier is 'DBTEST_CR';
enable configuration;
edit database 'DBTEST_CR' set property LogXptMode='SYNC';
edit database 'DBTEST_IN' set property LogXptMode='SYNC';
edit database 'DBTEST_CR' set property StandbyFileManagement='AUTO';
edit database 'DBTEST_IN' set property StandbyFileManagement='AUTO';
EDIT DATABASE 'DBTEST_CR' SET PROPERTY 'ArchiveLagTarget'=1200;
EDIT DATABASE 'DBTEST_IN' SET PROPERTY 'ArchiveLagTarget'=1200;
Edit database 'DBTEST_CR' set property StaticConnectIdentifier='(DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=oda-x11-cr)(PORT=1521))(CONNECT_DATA=(SERVICE_NAME=DBTEST_CR)(INSTANCE_NAME=DBTEST)(SERVER=DEDICATED)))';
Edit database 'DBTEST_IN' set property StaticConnectIdentifier='(DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=oda-x11-in)(PORT=1521))(CONNECT_DATA=(SERVICE_NAME=DBTEST_IN)(INSTANCE_NAME=DBTEST)(SERVER=DEDICATED)))';
edit configuration set protection mode as maxavailability;
show configuration lag;
exit;

Note that the StaticConnectIdentifier property is only mandatory when using a port different than 1521.

Recover standby database from service

If you still struggle to get your standby back in sync, because too many archivelogs are missing or because the archivelogs are not on the primary site anymore, you can use this nice RMAN command on your standby:

sqlplus / as sysdba
alter system set dg_broker_start=false;
exit;

srvctl stop database -db DBTEST_CR
sleep 10
srvctl start database -db DBTEST_CR -o mount

rman target /
recover database from service 'DBTEST_IN';
exit;

sqlplus / as sysdba
alter system set dg_broker_start=true;
exit;

This RECOVER DATABASE FROM SERVICE will do an incremental backup on the primary to recover the standby without needing the missing archivelogs. It’s convenient and much faster than rebuilding the standby from scratch.

Check SCN

In the good old days of Data Guard on Oracle 9i, the broker didn’t exist and you had to configure everything yourself. At this time, I used to have a look at the SCN on both databases for monitoring the lag. Nothing changed regarding the SCN: on a primary, you will never see the same SCN each time you query its value. This is because the query itself will increase the SCN by 1, as well as other background queries are running:

select current_scn from v$database;
CURRENT_SCN
-----------
  271650667

select current_scn from v$database;
CURRENT_SCN
-----------
  271650674

select current_scn from v$database;
CURRENT_SCN
-----------
  271650675

select current_scn from v$database;
CURRENT_SCN
-----------
  271650678

On a standby database, the SCN can only increase if changes are pushed by a primary. And for sure, the SCN will always be lower than the one on the primary. If your standby database is not opened, meaning that you don’t have the Active Guard option, you will query the same SCN for a couple of minutes, and you will see big jumps in figures from time to time:

select current_scn from v$database;
CURRENT_SCN
-----------
  271650664

select current_scn from v$database;
CURRENT_SCN
-----------
  271650664

select current_scn from v$database;
CURRENT_SCN
-----------
  271651042

select current_scn from v$database;
CURRENT_SCN
-----------
  271651042

Conclusion

With these few tips, troubleshooting Data Guard might be easier now.

L’article Troubleshooting Oracle Data Guard est apparu en premier sur dbi Blog.

Delphix Data Virtualization and Masking

Thu, 2025-02-20 09:55

After successfully completing several customer projects involving Delphix products, Data Virtualization and Data Masking, I decided to set up a complete infrastructure internally to test the combination of both products.

Delphix Data Virtualization Creates only a virtual copy of your database blocks instead of a full physical copy. For example this allows making a new big database available only in a few seconds.

Delphix Data Masking Protects sensitive data by replacing it with realistic but non-sensitive values, ensuring compliance with privacy regulations while maintaining data usability for development, testing, and analytics. For reference, dbi-services has developed Swiss masking rules for customers to apply all their specific needs.

Setup

As always, setting up the prerequisites took some time, but once the environment was built, it was truly rewarding to demonstrate the combined power of these tools to my colleagues. It’s a real pleasure, like a game, building database environments in a flash, where the data is directly masked.

This allows you to rapidly provision databases to your internal or external developers, without taking any risk of a data phishing attack!

Watch this video! It’s only a minute-long and you’ll see how easy it is! to create a new masked Oracle pluggable database VPDB2_masked on another server (These tasks can also be fully automated in a pipeline using an API)

Relevant outcomes

The Delphix pricing model is based on the size of your databases, thus you can apply the same Virtualization and Masking capabilities at least for PostgreSQL, SQL Server, and Sybase at no additional cost. Are you interested in making your database development environments fully secure? Don’t wait, dbi services has your solution!

For those who are interested in more details

1. Table data from the source database PDB1

2. Table data in the new created and masked virtual database VPDB2_Masked

3. Output of the masking job

Useful information related to Delphix

Blog that presents Data Masking in detail

Data Masking with Delphix Continuous Compliance

Customer success story

Galenica’s data security standards with Delphix

Webinar that presents Data Virtualization

L’article Delphix Data Virtualization and Masking est apparu en premier sur dbi Blog.

SUMA (SUSE Manager) is back and it has a new name: SUSE Multi-Linux

Wed, 2025-02-19 06:59

SUSE Multi-Linux Manager, formerly known as SUSE Manager (SUMA), is an open-source infrastructure management solution designed to simplify and secure the management of many Linux distribution out there. It provides you with a web console to manage not only SUSE and its distributions but almost any Open Source or payed subscription distro!

It’s already impressive that Multi Linux Manager can handle almost anything you throw at it, but in the wild west of Linux distributions, that’s just not enough! SUSE has gone the extra mile by providing Linux engineers with brand-new repositories for ancient distributions like CentOS 6 or RHEL 5, as well as other dinosaurs that are still running along without a break. Yes, we can finally reassure the security officer at our company that the almost 20-year-old server is patched and secured to the latest standards. No need to re-engineer the entire system with a shiny new server or lose huge sums of money for extended support (if it even exists).

SUSE Multi-Linux Manager in a nutshell:

Lets shortly break down what SUSE Multi-Linux Manager offers you:

Automated Patching and Updates:
- Multi-Linux Manager automates patch management across your Linux environment, ensuring systems remain secure and compliant.
- It uses certified software supply chains to deliver trusted patches and supports CVE (Common Vulnerabilities and Exposures) lists and OpenSCAP protocols for vulnerability scanning and compliance attestation.
- You can automate updates for entire systems or specific packages, minimizing downtime and human error.
Content Lifecycle Management (CLM):
- Multi-Linux Manager includes tools for managing the lifecycle of software content, allowing you to test updates in staging environments before deploying them to production.
- This ensures that no untested patches or updates are applied, reducing the risk of system instability.
Real-Time Monitoring and Reporting:
- Multi-Linux Manager provides real-time insights into the health and status of your Linux systems.
- It offers portfolio-wide reporting for auditing, enabling you to track compliance and security across all managed clients, regardless of their location.
Automation and Integration:
- Multi-Linux Manager leverages Salt (a powerful configuration management tool) for automation, enabling one-click provisioning, patching, and configuration.
- It integrates with other automation tools like Ansible, enhancing its flexibility in DevOps workflows.
- Features like Cobbler (for bare-metal OS deployment) and Kiwi (for building OS images) further streamline system provisioning.
Security and Compliance:
- Multi-Linux Manager ensures the highest quality security by providing proactive updates, transparent roadmaps, and predictable lifecycles.
- It offers auditing tools like OpenSCAP which provides you with a simple GUI based CIS scan and reporting.
Containerized Deployment (SUSE Manager 5.0):
- Starting with version 5.0, the Multi-Linux Manager server is delivered as a container image running on Podman. This decouples the server from the underlying OS, improving resilience and simplifying recovery.
- If the server fails, administrators can spin up a new container and reattach the database, minimizing downtime.

Installing and Using SUSE Multi-Linux Manager

First of all we need to download the SUSE Multi-Linux Manager:
https://www.suse.com/download/suse-manager/

Please Make sure to download the “SUSE-Manager-Server.x86_64-5.0.2-SelfInstall-2024.12.install.iso” because even known the name is almost the same, the “SUSE-Manager-Proxy.x86_64-5.0.2-SelfInstall-2024.12.install.iso” will only install the proxy, which needs the SUSE Manager Server to run.

The Installation Process is straight forward and doesn’t need any explanation. If you need any help please check out the official SUSE Documentation: https://documentation.suse.com/suma/5.0/en/suse-manager/installation-and-upgrade/container-deployment/suma/server-deployment-vm-suma.html

After installation, you need to execute a few commands to use the web console, as described here:

# It is suggested that we activates the SUSE Micro 5.5 and the Multi-Linux Manager Server 5.0 before using it so it can be updated with the newest patches.
# First lets register the SUSE Micro:
transactional-update register -r <REGCODE> -e <your_email>
# Then reboot to activate the changes:
reboot
# Now its time to activate the Multi-Linux Manager Server:
transactional-update register -p SUSE-Manager-Server/5.0/x86_64 -r <REGCODE>
# After activation we can update/ patch the system:
transactional-update
# HINT Multi-Linux Manager Server can absolutely be used without any of the registration steps. The only command really needed to make the web console available:
mgradm install podman hostname.domain.ch

Now we can see the Multi-Linux Manager web console and login to it with the user “admin” and the password we configured at the installation:

Before bootstrapping any server, ensure that you connect your SUSE organization credentials to your Multi-Linux Manager Server. You can find your credentials at the SUSE Customer Center. This will require a existing SUSE License which we can add in: Admin > Organization Credentials:

After adding the SUSE License, we can now add the repositories needed for the bootstrapping process. A bootstrap repository contains the required packages for registering clients during bootstrapping. To bootstrap a SUSE Enterprise Linux 15 SP6, you need to add the following repositories:

For RHEL, CentOS, Debian, or Ubuntu, you can use these repositories:

Note that each of these repositories requires around 50-100 GB of storage, and for Ubuntu, it might even need over 100GB for each repository! So, make sure you have enough space or only activate the products you’ll use. #Ididthatmistake

Now, you can bootstrap all the servers that need managing with Multi-Linux Manager by going to System > System List > Add System:

Once the bootstrapping is complete, you can manage the server from Multi-Linux Manager. The possibilities for server management are nearly endless.

Automated OpenSCAP scan and auditing of servers:

The OpenSCAP scan feature is incredibly powerful, allowing you to execute security audits on servers. To use it, you simply go to the server you wish to check and choose Audit > OpenSCAP > Schedule. Now you only need to add these Command-Line Arguments like “–profile xccdf_org.ssgproject.content_profile_cis_server_l1 –report /tmp/oscap-suma-report.html” along with the path to the XML-Profile, which is usually “/usr/share/xml/scap/content/ssg-sle15-ds.xml”. You can execute it manually or schedule the scan or for recurring security checks.

Update & Patch installation/ management:

Multi-Linux Manager not only manages all repositories but also categorizes patches into various topics like “Security Patch” or “Bug fixes and Improvements” and more. It allows you to see which packages are not properly managed (e.g., not installed from a connected repository) and which ones are no longer updated.

For me, an important feature is the software channels where I can check all repositories connected to each server. Additionally it can help you analyze issues regarding missing repositories and/ or unneeded channels.

Remote commands:

With this tool, you can run any scripts or commands on one or multiple machines, scheduling them to execute with specified user and group privileges.

Summary & Takeaways:

SUSE Multi-Linux Manager (once known as SUSE Manager) can manage a wide variety of Linux systems, giving you one place to handle updates, security, and overall system management. Here’s what makes it special:

Wide-Ranging Support: It works with all kinds of Linux, even very old versions like CentOS 6 or 7, keeping them safe and up-to-date.
Automatic Updates: Makes updating your systems easier with automatic patches, keeping track of security issues, and ensuring your setups meet security standards, which helps prevent mistakes and downtime.
Update Testing: Before applying updates everywhere, you can test them in a safe space to make sure they won’t cause problems.
System Watch and Reports: Gives you live updates on how your systems are doing and offers detailed checks to keep everything secure and following rules.
Automation Tools: Uses Salt to automate tasks, works well with Ansible, and has tools like Cobbler and Kiwi for setting up new systems, making things smoother for tech teams.
Security Checks: Automatically scans your systems to make sure they’re up to security standards.
Container Use: The latest version (5.0) uses containers, which makes it easier to set up, run, and recover if something goes wrong.
Easy to get started: It’s easy to install, but remember, each Linux repository can take up a lot of space on your server (50-100GB or more).

SUSE Multi-Linux Manager offers a smart and modern way to manage Linux systems, making it a must-have for IT teams dealing with mixed Linux environments.

L’article SUMA (SUSE Manager) is back and it has a new name: SUSE Multi-Linux est apparu en premier sur dbi Blog.

How to: Deploying a virtual machine using Infomaniak Public Cloud

Tue, 2025-02-18 03:17

This blog will focus on the procedure to deploy a virtual machine in Infomaniak Public Cloud. I will do it step by step to show every parameter and what to do at each step.

In your Infomaniak account, you must first go to Instances, then click on “Launch Instance” :

You will then be prompted with the virtual machine launch wizard. Here, select the value for multiple parameters of your virtual machine like the Project in which you want it to be deployed, the name or the availability zone:

Then move to the Source tab. Here you can chose your virtual machine’s OS and the size of its disk. To select an OS, simply use the arrows at the right of the panel:

In the next step, you have to chose the shape (Flavor) of your virtual machine. It’s the same method as before, simply use the arrows to select the shape you want to use.

The following mandatory parameter is the Network. Pick the the network in the list just as before:

Then the rest of the panels are for optional parameters, we will still go through them all. The first parameter is the Network Ports. Here you can add additional network ports to your virtual machine:

After that we have Security Groups, even though it’s an optional parameter we hardly recommend to set one up so you can access your virtual machine and protect it.

Then you have the Key Pair option that enables you to select a Key Pair to log into the virtual machine:

The Configuration panel will allow you to use a script at the boot of the virtual machine if you want to do some pre-configuration:

The Server Groups panel allows you to put your servers in groups to do some management e.g. you can use Server Groups to split virtual machines of one group on different physical hosts:

Next you have Scheduler Hints, which is an option that allows you to give some instructions to the physical server to customize your virtual machine:

Finally we have Metadata, which represents tags in the other cloud providers:

Click on launch instance. After a few minutes you will see your virtual machine in the Infomaniak console:

We can find in the Infomaniak Public Cloud Service all the features we could find in another public cloud provider when it comes to deploying virtual machines. The main difference in my opinion is the “Scheduler hints” which let’s you pick what kind of CPU architecture you want for example. Compared to other cloud provider where it is in the shape, here it’s a specific parameter. It doesn’t change much thing anyway.

It’s now your turn to play, you can find all the information about Infomaniak Public Cloud here.

L’article How to: Deploying a virtual machine using Infomaniak Public Cloud est apparu en premier sur dbi Blog.

An introduction to Azure Database watcher

Sun, 2025-02-16 17:07

Azure Database watcher is a managed monitoring service designed to provide insights and diagnostics for Azure databases and Azure Managed Instance. It helps database administrators and developers track performance, detect anomalies and troubleshoot issues efficiently. By collecting telemetry data, logs and metrics, Azure Database watcher enables proactive monitoring and alerting, ensuring database health and optimal performance. Moreover, Database watcher is currently in preview.

Service components

Database watcher
- The component responsible for collecting data using (for example) DMVs
Targets
- The components to monitor (supported components) :
  - Azure SQL Database
    - Elastic pool
  - Azure SQL Managed Instance
Data store
- The component that will store our data :
  - Azure Data Explorer
  - Real-Time Analytics in Microsoft Fabric

What does it look like in our demo ?

Deployment

We have an Azure SQL Database component at our disposal, which hosts a database called demo-sql-1. This database, along with our SQL server demo-sqlsrv-1, will be monitored.

The Database Watcher component connects to our instance using a system-assigned identity to collect the necessary information.

Database watcher creation :

The service is not yet available in the “Switzerland North” region.

The name of the system-assigned service principal is always the same as the name of the Azure resource it’s created for. In our case, we will need to grant some specfic privileges on the SQL database to the service principal (identified as demo-watcher-1).

If the Azure Data Explorer resource is not created beforehand, one is suggested (E2d v5) :

Azure Data Explorer cluster and database :

Add targets :

The resource is deployed :

Currently, the service is not accessible because it’s not started and we did not grant the necessary privileges.

We are going to grant the related privileges :

Privileges to grant in the case of an Azure SQL Database component :

CREATE LOGIN [demo-dbwatcher-1] FROM EXTERNAL PROVIDER;

ALTER SERVER ROLE ##MS_ServerPerformanceStateReader## ADD MEMBER [demo-dbwatcher-1];
ALTER SERVER ROLE ##MS_DefinitionReader## ADD MEMBER [demo-dbwatcher-1];
ALTER SERVER ROLE ##MS_DatabaseConnector## ADD MEMBER [demo-dbwatcher-1];

The created login on SQL side :

As specified earlier the name of the system-assigned service principal is always the same as the name of the Azure resource it’s created for :

https://learn.microsoft.com/en-us/azure/azure-sql/database-watcher-manage?view=azuresql&tabs=sqldb

Permission issues ?

It is possible that after starting the Azure Database watcher component, it appears blank. In this case, you may encounter a permission issue with the Azure Data Explorer component (in our case).

We assign the right permissions on the Azure Data Explorer component :

Once is done you will need to start the service :

We select the Dashboards link :

Our database is being monitored. Initially, we access the heatmap :

However, by clicking on the database name “demo-sql-1”, we gain access to more information:

We also have easy access to waits. They are categorized by type :

We also have access to performance counters :

At the same time, we initiated a series of insertions on a sample table to generate activity. We can see that it has been identified and what type of waits it generates :

We also have access to sessions. The displayed view shows which sessions consume the most resources :

We were also able to easily retrieve the query that generated the most activity via the “Top queries” tab :

We also have some information about waits :

We have access the Storage section :

We also have access to the tables and indexes related to our database :

Finally, we can easily retrieve the properties of our database without having to query the tables and system views :

We can clearly see that we have a highly efficient and easy-to-use monitoring tool at our disposal.

Pricing

The “Database Watchers” and Dashboards components are free however the storage part via “Azure Data Explorer” or “Real-time analytics in Microsoft Fabric” is not free. Data transfer between different components is free as long as all components (target, watcher and data store) are in the same region. Unfortunately, this service is not yet available in Switzerland, but it is possible to deploy it in another region to monitor databases hosted in the “Switzerland North” region.

Thank you, Amine Haloui.

L’article An introduction to Azure Database watcher est apparu en premier sur dbi Blog.

Mastering complex data security challenges leads to praise and honor

Wed, 2025-02-12 03:59

During the analysis of the requirements for the challenging yet incredibly exciting project for our customer Galenica, I remember one thing very well: my colleagues from the OMrun-Team, Product Owner and consultant Frank Zeindler and consultant Oliver Schwark were not always very confident of meeting the client’s expectations and goals…
But then happened what happened before – they dug into it and researched and tried and failed and tried again and failed better – until the project goals were exceeded and everybody was ready to “let pop the corks”!

Take yourself a coffee or tea and enjoy the full testimonial Galenica’s data security standards with Delphix (four minutes read).

What wonderful words are praising the successful effort and achievements of the dbi services team – and every bit is well deserved!

What are the key factors to a project success?

For me, the major takeaway from this project – and from the overwhelmingly positive testimonial – is the following. (I admit, it’s not groundbreaking, but always good as a reminder.)
Because the key questions are: Why do some projects fail and others succeed? What are the key factors of a successful project?
I firmly believe that it is a combination of the following “ingredients” that lift teams from average to excellence project achievers:

dedication
curiosity
creative problem-solving approach
flexibility in an environment of changing goals and requirements
in-depth technical expertise (yes, still needed, even in the times of AI…)
willingness and ability to cooperate across team boundaries
willingness and ability to listen and ask questions
not to be discouraged by obstacles of any kind
allow time for a good laugh

If all these requirements are met, the success of the project is not yet guaranteed – but it is very, very likely!

L’article Mastering complex data security challenges leads to praise and honor est apparu en premier sur dbi Blog.

Cleanup Oracle After Patching

Tue, 2025-02-11 10:55

cleanup Oracle-Home(s)
after applying the quarterly Release Updates

Do you suffer from the fact that release updates take longer each time?
Are you afraid that your hard disk will fill up because the Oracle-Home is getting bigger and bigger?
Here is the how-to for cleaning up your environment.

A) cleanup – option 1 for Out-Of-Place patching

According to Oracle’s recommendations, you did “out of place patching” and have now several Oracle-Home directories.
All your databases are using the new ORACLE_HOME (OH) and the old OHs are now obsolete.
Cleaning up is not only a matter of disk space. The old Software releases contain security issues – of course. This is the very reason for patching.

Set the environment for the OH you want to delete

export ORACLE_HOME=/u01/app/oracle/product/19.25.0/dbhome_1
PATH=$ORACLE_HOME/bin:$ORACLE_HOME/OPatch:$ORACLE_HOME/perl/bin:$ORACLE_HOME/ctx/bin
PATH=$PATH:/home/oracle/.local/bin:/home/oracle/bin:/usr/local/bin:/usr/bin:/usr/ccs/bin:/usr/local/sbin:/usr/sbin:/sbin
export PATH
export CV_ASSUME_DISTID=OEL8.1

cd $ORACLE_HOME/deinstall
./deinstall
   -> enter required answers …
       Enter     (confirm [LISTENER] )
       Enter     (confirm databases [] )
        Y        (to confirm to delete the listed OH)

you can delete the empty directory, now

cd /u01/app/oracle/product/
du -sh *
rmdir 19.25.0

B) cleanup – option 2 for Out-Of-Place patching

In some cases, the deinstall does not work and gives you some error messages.
Now we have the option to detach AND remove the old ORACLE_HOME.
In this example our OH-path is: /u01/app/oracle/product/19.15.0.0.220419.EE

first you need to find the “Home-Name” for the OH to deinstall – you might have to look into several subfolders for it

cd /u01/app/oraInventory/logs
ls -l
cd InstallActions<timestamp>
grep -i ORACLE_HOME_NAME *.log
-->> responds for example with:   OraDB19Home1

set the environment for the ORACLE_HOME to remove and DETACH it

export ORACLE_HOME=/u01/app/oracle/product/19.15.0.0.220419.EE
PATH=$ORACLE_HOME/bin:$ORACLE_HOME/OPatch:$ORACLE_HOME/perl/bin:$ORACLE_HOME/ctx/bin
PATH=$PATH:/home/oracle/.local/bin:/home/oracle/bin:/usr/local/bin:/usr/bin:/usr/ccs/bin:/usr/local/sbin:/usr/sbin:/sbin
export PATH
export CV_ASSUME_DISTID=OEL8.1

cd $ORACLE_HOME/oui/bin
./runInstaller -silent -detachHome ORACLE_HOME="/u01/app/oracle/product/19.15.0.0.220419.EE" ORACLE_HOME_NAME="OraDB19Home1"

cd /u01/app/oracle/product
rm -rf 19.15.0.0.220419.EE

and don’t forget to remove the obsolete OH from the oratab file

vi /etc/oratab

C) cleanup – option for IN-Place patching

For reasons, you could not do “out of place” patching. When you have an OH which was already patched several times, the OH increased in size over time. The old patches got stored in a subdirectory for rollback reasons.

Simply check the size of the subdirectory

du -sh $ORACLE_HOME/.patch_storage/
-->> 6.6G  .patch_storage/

I checked this for an ORACLE_HOME with up to 7 RUs installed:

.patch_storage with   2 RUs     3 RUs     4 RUs     5 RUs     6 RUs     7 RUs
          Size GB      5.6       6.6        9.1       12        14        18

you can imagine that each run of opatch needs more and more time for a larger repository in directory ‘.patch_storage’.
For the first patch it goes ‘on the fly’. The second needs a coffee break, the third needs a lunch break …

with this command you can list your obsolete patches

opatch util listorderedinactivepatches

now delete them (this will take a while) and it leaves intentionally the last RU in the repository

opatch util deleteinactivepatches

the disk usage is now reduced, and the next patch will run definitively faster

du -sh $ORACLE_HOME/.patch_storage/
-->> 6.0G .patch_storage/

I am an impatient DBA and I don’t like drinking lots of coffee while waiting for opatch to finish. This is why I love to keep my Oracle-Home directories small and smart.

L’article Cleanup Oracle After Patching est apparu en premier sur dbi Blog.

PostgreSQL 18: Virtual generated columns

Mon, 2025-02-10 04:29

This is about another feature which was committed for PostgreSQL 18: Virtual generated columns. Generated columns are available already, but they need to be “STORED”. This means the result of an expression is stored to disk and the result is computed while writing, but you cannot write to that column explicitly. Before looking at the virtual generated columns lets do a quick refresh on “STORED” generated columns.

Consider a simple a table like this:

postgres=# create table t ( a int, b int );
CREATE TABLE
postgres=# insert into t values (1,99);
INSERT 0 1

If you want to automatically calculate the sum of “a” and “b” you could use a generated column for this:

postgres=# alter table t add column c int generated always as (a+b) stored; 
ALTER TABLE
postgres=# select * from t;
 a | b  |  c  
---+----+-----
 1 | 99 | 100
(1 row)

As mentioned above, you are not allowed to directly write to that column:

postgres=# update t set c = -1 where a = 1;
ERROR:  column "c" can only be updated to DEFAULT
DETAIL:  Column "c" is a generated column.

A downside of this type of generated column is, that the result of the expression is actually stored to disk and obviously consumes space.

With the addition of “virtual generated columns” this concept is somehow flipped: The expression of a virtual column is computed while reading, so there is no storage on disk. On the other side you spend the work for computing the expression also while reading, and not while writing as it is done with “STORED” virtual columns:

postgres=# alter table t add column d int generated always as (a-b) virtual; 
ALTER TABLE
postgres=# select * from t;
 a | b  |  c  |  d  
---+----+-----+-----
 1 | 99 | 100 | -98
(1 row)

The default, from PostgreSQL 18 on, will be virtual, which is also reflected here:

postgres=# \d t
                               Table "public.t"
 Column |  Type   | Collation | Nullable |              Default               
--------+---------+-----------+----------+------------------------------------
 a      | integer |           |          | 
 b      | integer |           |          | 
 c      | integer |           |          | generated always as (a + b) stored
 d      | integer |           |          | generated always as (a - b)

Nice, thanks to everyone involved.

L’article PostgreSQL 18: Virtual generated columns est apparu en premier sur dbi Blog.

Customer case study – SQL Server table partitioning

Fri, 2025-02-07 14:09

A client submitted the following issues :

Their database hosts multiple tables, including one table of approximately 1 TB, consisting of several billion rows. This table is continuously fed by various data sources. Additionally, it is also used to generate reports and charts based on the stored data.

The client wanted to archive certain data to be able to delete it (selectively) without affecting the applications that use the table.

They also wanted to reduce the duration of maintenance jobs for indexes, statistics, and those responsible for verifying data integrity.

During an audit of their environment, we also noticed that the database data files were hosted on a volume using an MBR partition type (which is limited to 2 TB).

2 tables were involved (but only one needed to be partitioned, the “Data” table) because we had a foreign key constraint between the tables. Here’s what the tables looked like :

Table nameColumns nameModeIDDataDate, Value, Mode ID Possible solution : table partitioning

Partitioning in SQL Server is a technique used to divide large tables or indexes into smaller, more manageable pieces called partitions, while still being treated as a single logical entity. This improves performance, manageability, and scalability, especially for large datasets. In SQL Server, partitioning is done using a partition function which defines how data is distributed based on a column (e.g., date, ID) and a partition scheme which is a mapping mechanism that determines where partitions of a partitioned table or index will be physically stored. Queries benefit from partition elimination, meaning SQL Server scans only relevant partitions instead of the entire table, optimizing execution. This is widely used in data warehousing, archiving, and high-transaction databases to enhance query performance and simplify maintenance.

Production implementation ?

How can we implement this with minimal downtime since we cannot modify the main table, which is used continuously ?

In our case, we decided to create a new table that is an exact copy of the source table however this new table will be partitioned. Here’s how we proceeded :

Creating the corresponding filegroups

We have one filegroup per year from 2010 to 2060. We extend up to 2060 to avoid creating a job that dynamically generates filegroups based on certain criteria.

The files contained in these filegroups will be created in a new volume using the GPT partition type, allowing us to move the table to the new volume.

Creating the partition function

The partition function will be based on a datetime column, which will determine the partition ID according to the input value.

Creating the partition scheme

This will define where the data is physically stored. The partition scheme maps the partition ID to the filegroups.

From here, we have at least two possibilities :

Create a partitioned (copy) table

This table will have the same structure as the source table but without indexes.
The new table will be partitioned using CREATE TABLE() ON partition_scheme().
Initially, we copy data from the source table to the destination table up to a fixed date. This limit allows us to define a delta.
We then build the indexes.
The remaining data to be copied is the delta. The delta can be determined by selecting all rows with a date strictly greater than the copied data. This is easier when using the right indexes.

Using indexes (same process but without the last step)

We build the corresponding indexes (before copying data), which are an exact copy of those in the source table.
We copy data from the source table to the destination table up to a fixed date.
We copy the delta.
Finally, we switch the tables using the sp_rename stored procedure.

How can we copy the delta?

In our case, we stopped the application for a few minutes and then imported all rows from the fixed date onward (since writes were paused for a specific period, we copied the delta).

Implementation

In our client’s case, we have dependencies with another table through a foreign key constraint. This is why we have two tables appearing (however, the partitioned table is T_Data).

The code below reproduces the case we worked on.

Database creation :

USE master
GO

IF DB_ID('partitioning_demo') IS NOT NULL
BEGIN
ALTER DATABASE partitioning_demo SET SINGLE_USER WITH ROLLBACK IMMEDIATE
DROP DATABASE partitioning_demo
END

CREATE DATABASE partitioning_demo
ON PRIMARY
(
NAME = PARTITIONING_0_Dat,
FILENAME = 'S:\Data\partitioning_demo_0_Dat.mdf',
SIZE = 512MB,
FILEGROWTH = 512MB
)
LOG ON
(
name = PARTITIONING_0_Log,
filename = 'S:\Logs\partitioning_demo_0_Log.ldf',
size = 1024MB,
filegrowth = 512MB
)
GO

Tables creation :

USE partitioning_demo
GO

CREATE TABLE dbo.T_Mode
(
Mode_ID INT NOT NULL
CONSTRAINT PK_Mode_ID PRIMARY KEY(Mode_ID)
)
GO

CREATE TABLE dbo.T_Data
(
Data_Date DATETIME NOT NULL,
Data_Value DECIMAL(18,3) NOT NULL,
FK_Mode_ID INT NOT NULL

CONSTRAINT PK_T_Data_1 PRIMARY KEY(Data_Date, FK_Mode_ID)
CONSTRAINT FK_T_Data_T_Mode FOREIGN KEY(FK_Mode_ID) REFERENCES dbo.T_Mode(Mode_ID)
)
GO

CREATE NONCLUSTERED INDEX [NCI-1] ON dbo.T_Data(Data_Value)
GO

Generate some data for the T_Mode table :


USE partitioning_demo
GO

SET NOCOUNT ON

DECLARE @i INT = 0, @Min INT = 1, @Max INT = 300

WHILE @i <= @Max
BEGIN
INSERT INTO dbo.T_Mode (Mode_ID) VALUES (@i)
SET @i = @i + 1
END
GO

Generate some data for the T_Data table :

USE partitioning_demo
GO

SET NOCOUNT ON

DECLARE
@i BIGINT,
@NbLinesMax BIGINT,
@StartDateTime DATETIME,
@DataDate DATETIME,
@DataValue DECIMAL(18,3),
@NbLinesFKModeID INT,
@FKModeID INT

SET @i = 0

SET @NbLinesMax = 7884000 --7884000 - nb minutes in 14 years (from 2010 to 2024)
SET @StartDateTime = DATEADD(yy, DATEDIFF(yy, 0, DATEADD(year, -15, GETDATE())), 0) --We start in 2010 : 01.01.2010 00:00:00
SET @NbLinesFKModeID = (SELECT COUNT(*) FROM partitioning_demo.dbo.T_Mode) - 1

WHILE @i <= @NbLinesMax
BEGIN

SET @DataDate = DATEADD(mi, @i, @StartDateTime)
SET @DataValue = ROUND(RAND(CHECKSUM(NEWID())) * (100000000000000), 3) --Generate random values for the Data_Value column

SET @FKModeID = ABS(CHECKSUM(NEWID()) % (@NbLinesFKModeID - 1 + 1)) + 1 --Generate random values for the FK_Mode_ID column

INSERT INTO dbo.T_Data (Data_Date, Data_Value, FK_Mode_ID)
VALUES (@DataDate, @DataValue, @FKModeID)
SET @i = @i + 1
END

GO

Here is what our data looks like :

We create the corresponding filegroups. We create more filegroups than necessary to anticipate future insertions:

USE master
GO

DECLARE @Year INT = 2010, @YearLimit INT = 2060, @SQLCmd NVARCHAR(max) = ''

WHILE (@Year <= @YearLimit)
BEGIN

SET @SQLCmd = @SQLCmd+'ALTER DATABASE partitioning_demo ADD FILEGROUP PARTITIONING_FG_'+CAST(@Year AS NVARCHAR(4))+'; '

SET @SQLCmd = @SQLCmd+'ALTER DATABASE partitioning_demo ADD FILE (NAME = PARTITIONING_F_'+CAST(@Year AS NVARCHAR(4))+'_Dat, FILENAME = ''S:\Data\PARTITIONING_F_'+CAST(@Year AS NVARCHAR(4))+'_Dat.mdf'', SIZE = 64MB, FILEGROWTH = 64MB) TO FILEGROUP PARTITIONING_FG_'+CAST(@Year AS NVARCHAR(4))+';'

SET @Year = @Year + 1
END

--PRINT @SQLCMD
EXEC(@SQLCMD)

We create our partition function to process data from 2010 to 2060 :

USE partitioning_demo
GO
DECLARE

@i INT = -15,
@PartitionYear DATETIME = 0,
@PartitionFunctionName NVARCHAR(50) = '',
@SQLCMD NVARCHAR(MAX) = ''

SET @PartitionFunctionName = 'PF_T_Data'

SET @SQLCMD = 'CREATE PARTITION FUNCTION ' + @PartitionFunctionName + ' (DATETIME) AS RANGE RIGHT FOR VALUES ('

WHILE (@i <= 35)
BEGIN
SET @PartitionYear = (SELECT DATEADD(yy, DATEDIFF(yy, 0, DATEADD(year, @i, GETDATE())), 0)) -- Start of a year, e.g. 2010-01-01 00:00:00.000

IF (@i <> 35)
BEGIN
SET @SQLCMD = @SQLCMD + '''' + CAST(@PartitionYear AS NVARCHAR(150)) + ''', '
END
ELSE
BEGIN
SET @SQLCMD = @SQLCMD + '''' + CAST(@PartitionYear AS NVARCHAR(150)) + ''')'
END

SET @i = @i + 1
END        

--PRINT @SQLCMD
EXEC(@SQLCMD)

We create our partition scheme:

USE partitioning_demo
GO

DECLARE
@PartitionFunctionName NVARCHAR(50) = '',
@PartitionSchemeName NVARCHAR(50) = '',
@SQLCMD NVARCHAR(MAX) = '',
@FGName NVARCHAR(100) = '',
@FGNames NVARCHAR(MAX) = ''

SET @PartitionFunctionName = 'PF_T_Data'
SET @PartitionSchemeName = 'PSCH_T_Data'

SET @SQLCMD = 'CREATE PARTITION SCHEME ' + @PartitionSchemeName + ' AS PARTITION ' + @PartitionFunctionName + ' TO ('

DECLARE filegroup_cursor CURSOR FOR

    SELECT [name] FROM partitioning_demo.sys.filegroups ORDER BY data_space_id ASC

OPEN filegroup_cursor
FETCH filegroup_cursor INTO @FGName

WHILE @@FETCH_STATUS = 0
BEGIN

SET @FGNames = @FGNames + '[' + @FGName + '],'
    FETCH filegroup_cursor INTO @FGName
END

CLOSE filegroup_cursor
DEALLOCATE filegroup_cursor

SET @FGNames = LEFT(@FGNames, LEN(@FGNames) - 1) --Remove the ',' character at the end
SET @SQLCMD = @SQLCMD + @FGNames + ')'

--PRINT @SQLCMD
EXEC(@SQLCMD)

We will now create the new table. This is the one that will be partitioned and to which we will switch later :

USE partitioning_demo
GO

CREATE TABLE dbo.T_Data_Staging
(
Data_Date DATETIME NOT NULL,
Data_Value DECIMAL(18,3) NOT NULL,
FK_Mode_ID INT NOT NULL

) ON PSCH_T_Data(Data_Date)
GO

We now copy the data from the source table to the destination table. Since our table is a heap, we can use the query hint (TABLOCK). This enables minimal logging, optimal locking and parallel inserts (however, the recovery model must be bulk-logged or simple). In the case below, we only copy the year 2010 (for example) :

USE partitioning_demo
GO

DECLARE @YearToProcess INT, @BeginDataDate DATETIME, @EndDataDate DATETIME;

SET @YearToProcess = 2010
SET @BeginDataDate = DATEADD(yy, DATEDIFF(yy, 0, CAST(@YearToProcess AS NVARCHAR)), 0)
SET @EndDataDate = DATEADD(ms, -3, DATEADD(yy, DATEDIFF(yy, 0, CAST(@YearToProcess AS NVARCHAR)) + 1, 0))

INSERT INTO dbo.T_Data_Staging WITH (TABLOCK)
(
Data_Date,
Data_Value,
FK_Mode_ID
)

SELECT * FROM dbo.T_Data WHERE Data_Date BETWEEN @BeginDataDate AND @EndDataDate
GO

However, it is possible to adjust the value of the @EndDataDate variable to copy the data up to the desired point.

We now create the corresponding indexes. These are indeed present in the source table. Additionally, these indexes are partitioned.

USE partitioning_demo
GO

ALTER TABLE dbo.T_Data_Staging ADD CONSTRAINT PK_T_Data_2 PRIMARY KEY(Data_Date, FK_Mode_ID)
GO

ALTER TABLE dbo.T_Data_Staging ADD CONSTRAINT FK_T_Data_T_Mode_2 FOREIGN KEY(FK_Mode_ID) REFERENCES dbo.T_Mode(Channel_ID)
GO

CREATE NONCLUSTERED INDEX [NCI-1-2] ON dbo.T_Data_Staging(Data_Value) ON PSCH_T_Data(Data_Date)
GO

Another possible strategy

From this point, another strategy is possible, which is as follows :

We create the partitioned table.
We create the corresponding indexes.
We copy the data.

However, with this strategy, it is possible that the TempDB database may grow significantly. In this case, it is possible to use the following query hints :

min_grant_percent
max_grant_percent

We can also temporarily update statistics asynchronously. Once the vast majority of the data has been copied, we simply need to retrieve the delta and switch the tables. In our case, we had to stop the application for a few minutes before performing the switch as follows :

USE partitioning_demo
GO

BEGIN TRANSACTION Table_Switch

EXEC sp_rename 'dbo.T_Data', 'T_Data_Old'
EXEC sp_rename 'dbo.T_Data_Staging', 'T_Data'

COMMIT TRANSACTION Table_Switch

We can then verify that our data has been properly distributed among the different filegroups:

SELECT

OBJECT_NAME(p.object_id) as obj_name,
f.name,
p.partition_number,
p.rows,
p.index_id,
CASE
WHEN p.index_id = 1 THEN 'CLUSTERED INDEX'
WHEN p.index_id >= 2 THEN 'NONCLUSTERED INDEX'
END AS index_info

FROM sys.system_internals_allocation_units a
JOIN sys.partitions p
ON p.partition_id = a.container_id
JOIN sys.filegroups f 
on a.filegroup_id = f.data_space_id
WHERE p.object_id = OBJECT_ID (N'dbo.T_Data_Staging')

ORDER BY f.name ASC

Database maintenance

For database maintenance, we can use Ola Hallengren’s solution (https://ola.hallengren.com).

Database integrity check

This solution allows us to verify data integrity and filter by filegroup. In our case, we have defined one filegroup per year, and the filegroup on which we write regularly is the one for the current year.

Thus, we could implement the following strategy to reduce the time needed for data integrity verification:

Create a job that checks the integrity of the current filegroup once per day.
Create a job that checks the integrity of other filegroups once per week.

USE partitioning_demo
GO

DECLARE @Databases NVARCHAR(100), @FilegroupsToCheck NVARCHAR(max)

SET @Databases = 'partitioning_demo'
SET @FilegroupsToCheck = @Databases + '.PARTITIONING_FG_' + CAST(YEAR(GETDATE()) AS NVARCHAR)

EXECUTE [dba_tools].[dbo].[DatabaseIntegrityCheck]
@Databases = @Databases,
@CheckCommands = 'CHECKFILEGROUP',
@FileGroups = @FilegroupsToCheck,
@LogToTable = 'Y'
GO

The verification of the other filegroups can be done as follows :

DECLARE @Database NVARCHAR(250), @FilegroupsToCheck NVARCHAR(MAX)

SET @Database = 'partitioning_demo'
SET @FilegroupsToCheck = 'ALL_FILEGROUPS, -' + @Database + '.PARTITIONING_FG_' + CAST(YEAR(GETDATE()) AS NVARCHAR)

EXECUTE [dba_tools].[dbo].[DatabaseIntegrityCheck]
@Databases = 'partitioning_demo',
@CheckCommands = 'CHECKFILEGROUP',
@FileGroups = @FilegroupsToCheck,
@LogToTable = 'Y'
GO

Index maintenance :

Index maintenance can be a long and resource-intensive operation in terms of CPU and I/O. Based on our research, there is no way to rebuild or reorganize a specific partition using Ola Hallengren’s solution.

Indeed, it may be beneficial to maintain only the current partition (the one where data is updated) and exclude the other partitions. To achieve this, the following example can be used:

USE partitioning_demo
GO

DECLARE

@IndexName NVARCHAR(250),
@IndexId INT,
@ObjectId INT,
@PartitionNumber INT,
@SchemaName NVARCHAR(50),
@TableName NVARCHAR(100),
@IndexFragmentationValue INT,
@IndexFragmentationLowThreshold INT,
@IndexFragmentationHighThreshold INT,
@FilegroupToCheck NVARCHAR(250),
@PartitionToCheck INT,

@SQLCMD NVARCHAR(MAX) = ''

SET @FilegroupToCheck = 'PARTITIONING_FG_' + CAST(YEAR(GETDATE()) AS NVARCHAR)
SET @PartitionNumber = (
SELECT
DISTINCT
p.partition_number
FROM sys.system_internals_allocation_units a
JOIN sys.partitions p
ON p.partition_id = a.container_id
JOIN sys.filegroups f on a.filegroup_id = f.data_space_id
WHERE f.[name] = @FilegroupToCheck)

DECLARE index_cursor CURSOR FOR

SELECT DISTINCT idx.[name], idx.index_id, idx.[object_id], pts.partition_number, scs.[name], obj.[name]
FROM sys.indexes idx
INNER JOIN sys.partitions pts
ON idx.object_id = pts.object_id
INNER JOIN sys.objects obj
ON idx.object_id = obj.object_id
INNER JOIN sys.schemas scs
ON obj.schema_id = scs.schema_id
WHERE  pts.partition_number = @PartitionNumber
OPEN index_cursor
FETCH index_cursor INTO @IndexName, @IndexId, @ObjectId, @PartitionNumber, @SchemaName, @TableName

WHILE @@FETCH_STATUS = 0

BEGIN
SELECT @IndexFragmentationValue = MAX(avg_fragmentation_in_percent)
FROM sys.dm_db_index_physical_stats(DB_ID('partitioning_demo'), @ObjectId, @IndexId, @PartitionNumber, 'LIMITED')
WHERE alloc_unit_type_desc = 'IN_ROW_DATA' AND index_level = 0

IF (@IndexFragmentationValue < 5)
BEGIN
PRINT 'No action to perform for the index : [' + @IndexName + '] ON [' + @SchemaName + '].[' + @TableName + ']'
END

ELSE IF (@IndexFragmentationValue BETWEEN 5 AND 20)
BEGIN
SET @SQLCMD = @SQLCMD + 'ALTER INDEX [' + @IndexName + '] ON [' + @SchemaName + '].[' + @TableName + '] REORGANIZE PARTITION = ' + CAST(@PartitionNumber AS NVARCHAR) + ';' + CHAR(13)
END

ELSE IF (@IndexFragmentationValue > 20)
BEGIN
SET @SQLCMD = @SQLCMD + 'ALTER INDEX [' + @IndexName + '] ON [' + @SchemaName + '].[' + @TableName + '] REBUILD PARTITION = ' + CAST(@PartitionNumber AS NVARCHAR) + ';' + CHAR(13)
END

    FETCH index_cursor INTO @IndexName, @IndexId, @ObjectId, @PartitionNumber, @SchemaName, @TableName

END

CLOSE index_cursor
DEALLOCATE index_cursor

--PRINT @SQLCMD
EXEC(@SQLCMD)

Once the partition maintenance is completed, we can then maintain all other indexes (such as those of other tables) in the following way:

EXECUTE [dba_tools].[dbo].[IndexOptimize]
@Databases = 'USER_DATABASES',
@UpdateStatistics = 'ALL',
@Indexes = 'ALL_INDEXES,-[partitioning_demo].[dbo].[PK_T_Data_2],-[partitioning_demo].[dbo].[NCI-1-2]',
@LogToTable = 'Y'

L’article Customer case study – SQL Server table partitioning est apparu en premier sur dbi Blog.

PostgreSQL 18: Introduce autovacuum_vacuum_max_threshold

Fri, 2025-02-07 01:57

Vacuum/Autovacuum is one of the critical parts of every PostgreSQL installation. When autovacuum is not configured properly for your workload you’ll suffer from bloat and performance issues sooner or later. Most of the installations we’ve seen run with the defaults just fine, and a lot of people probably never need to deal with adjusting any of the parameters for autovacuum. On the other side there are workloads where the defaults do not work nicely anymore and you need to adjust how autovacuum deals with specific tables. PostgreSQL 18 will come with a new parameter called “autovacuum_vacuum_max_threshold” which gives you one more option to deal with a specific issue.

Before we look at the new parameter lets take a look at when autovacuum kicks in in the default configuration. This is controlled by two parameters:

postgres=# show autovacuum_vacuum_threshold;
 autovacuum_vacuum_threshold 
-----------------------------
 50
(1 row)

postgres=# show autovacuum_vacuum_scale_factor;
 autovacuum_vacuum_scale_factor 
--------------------------------
 0.2
(1 row)

What that means is, that approximately 20% of the table (the 0.2 of autovacuum_vacuum_scale_factor) + 50 tuples (autovacuum_vacuum_threshold) need to change before autovacuum is triggered. Given this simple table with one million rows:

postgres=# \d t
                 Table "public.t"
 Column |  Type   | Collation | Nullable | Default 
--------+---------+-----------+----------+---------
 a      | integer |           |          | 
 b      | text    |           |          | 

postgres=# select count(*) from t;
  count  
---------
 1000000
(1 row)

… this can easily be triggered by changing more than 20% of the table:

postgres=# select last_autovacuum from pg_stat_all_tables where relname = 't';
       last_autovacuum        
------------------------------
 2025-02-07 07:24:58.40076+01
(1 row)

postgres=# select now();
              now              
-------------------------------
 2025-02-07 07:26:48.333006+01
(1 row)

postgres=# update t set b = 'xxx' where a < 250000;
UPDATE 249999

postgres=# select pg_sleep('60');
 pg_sleep 
----------
 
(1 row)

postgres=# select last_autovacuum from pg_stat_all_tables where relname = 't';
        last_autovacuum        
-------------------------------
 2025-02-07 07:27:58.356337+01
(1 row)

The consequence of this is, that the more rows you have in a table, the longer it takes for autovacuum to kick in. You can deal with this already today by adjusting either “autovacuum_vacuum_threshold” or “autovacuum_vacuum_scale_factor” or both on either the table or globally on the instance level. If, for example, you want autovacuum to kick in after 10’000 rows have been changed in the above table you can do it like this:

postgres=# alter table t set ( autovacuum_vacuum_threshold = 10000 );
ALTER TABLE
postgres=# alter table t set ( autovacuum_vacuum_scale_factor = 0 );
ALTER TABLE

Doing the same test as above but only changing 10001 rows:

postgres=# update t set b = 'aaa' where a < 10002;
UPDATE 10001
postgres=# select now();
              now              
-------------------------------
 2025-02-07 07:54:35.295413+01
(1 row)

postgres=# select last_autovacuum from pg_stat_all_tables where relname = 't';
        last_autovacuum        
-------------------------------
 2025-02-07 07:27:58.356337+01
(1 row)

postgres=# select pg_sleep(60);
 pg_sleep 
----------
 
(1 row)

postgres=# select last_autovacuum from pg_stat_all_tables where relname = 't';
       last_autovacuum        
------------------------------
 2025-02-07 07:54:58.69969+01
(1 row)

The downside of this is, that you need to deal with that manually. With the introduction of “autovacuum_vacuum_max_threshold” PostgreSQL will handle those cases in a more “by default” way. The default for this parameter is quite high:

postgres=# show autovacuum_vacuum_max_threshold;
 autovacuum_vacuum_max_threshold 
---------------------------------
 100000000
(1 row)

To see it in action lets reset the table level settings we did above and set autovacuum_vacuum_max_threshold instead:

postgres=# alter table t reset ( autovacuum_vacuum_scale_factor );
ALTER TABLE
postgres=# alter table t reset ( autovacuum_vacuum_threshold );
ALTER TABLE
postgres=# alter table t set ( autovacuum_vacuum_max_threshold = 10000 );
ALTER TABLE

This will have exactly the same effect:

postgres=# update t set b = 'qqq' where a < 10002;
UPDATE 10001
postgres=# select now();
              now              
-------------------------------
 2025-02-07 08:02:51.582044+01
(1 row)

postgres=# select last_autovacuum from pg_stat_all_tables where relname = 't';
       last_autovacuum        
------------------------------
 2025-02-07 07:54:58.69969+01
(1 row)

postgres=# select pg_sleep(60);
 pg_sleep 
----------
 
(1 row)

postgres=# select last_autovacuum from pg_stat_all_tables where relname = 't';
        last_autovacuum        
-------------------------------
 2025-02-07 08:02:58.809895+01
(1 row)

Nice, and as always, thanks to everyone involved.

L’article PostgreSQL 18: Introduce autovacuum_vacuum_max_threshold est apparu en premier sur dbi Blog.

Site search

Site navigation

Aggregator

Yann Neuhaus

PostgreSQL 18: More granular log_connections

Why Metadata Matters?

pgvector, a guide for DBA – Part2 indexes

Oracle Technology Roundtable for Digital Natives – Let’s have a look at AI, Cloud and HeatWave

M-Files Property Calculator

PostgreSQL: Unlogged tables and backups

Starting with PowerShell 7 and parallelization

pg_mooncake: (another) Columnar storage for PostgreSQL

Updating SUSE/ SLES the right way

OpenStack – Deploying the first compute instance

Troubleshooting Oracle Data Guard

Delphix Data Virtualization and Masking

SUMA (SUSE Manager) is back and it has a new name: SUSE Multi-Linux

How to: Deploying a virtual machine using Infomaniak Public Cloud

An introduction to Azure Database watcher

Mastering complex data security challenges leads to praise and honor

Cleanup Oracle After Patching

PostgreSQL 18: Virtual generated columns

Customer case study – SQL Server table partitioning

PostgreSQL 18: Introduce autovacuum_vacuum_max_threshold

Pages

Site search

User login

Site navigation

Aggregator

You are here

Yann Neuhaus

Pages