CloudSQL with PostgreSQL very slow performance
Asked Answered
N

6

6

I wanted to migrate from BigQuery to CloudSQL to save cost. My problem is that CloudSQL with PostgreSQL is very very slow compare to BigQuery. A query that takes 1.5 seconds in BigQuery takes almost 4.5 minutes(!) on CloudSQL with PostgreSQL.

I have CloudSQL with PostgreSQL server with the following configs:

enter image description here

My database have a main table with 16M rows (around 14GB in RAM).

A example query:

EXPLAIN ANALYZE 
  SELECT 
    "title"
  FROM 
    public.videos
  WHERE 
      EXISTS (SELECT 
                * 
              FROM (
                    SELECT 
                      COUNT(DISTINCT CASE WHEN LOWER(param) LIKE '%thriller%' THEN '0'
                                          WHEN LOWER(param) LIKE '%crime%' THEN '1' END) AS count
                    FROM
                      UNNEST(categories) AS param
                    ) alias
                        WHERE count = 2)

  ORDER BY views DESC 

  LIMIT 12 OFFSET 0

The table is a videos tables with categories column as text[]. The search condition here looks where there is a categories which is like '%thriller%' and like '%crime%' exactly two times

The EXPLAIN ANALYZE of this query gives this output (CSV): link. The EXPLAIN (BUFFERS) of this query gives this output (CSV): link.

Query Insights graph:

enter image description here

Memory profile:

enter image description here

BigQuery reference for the same query on the same table size:

enter image description here

Server config: link.

Table describe: link.

My goal is to have Cloud SQL with the same query speed as Big Query

Nurserymaid answered 10/11, 2021 at 22:49 Comment(10)
Please show us the table definition, including the indexes. Is there any chance you could normalize your categories?Outstrip
Sorry I cant show it due to IP issues. No indexes.Nurserymaid
Does this answer your question? https://mcmap.net/q/575076/-bigquery-vs-cloud-sql-for-dashboards-backendTajo
You should turn on track_io_timing, then run EXPLAIN (ANALYZE, BUFFERS) and post the results as properly formatted to text (not as images), preserving indentation. Does BigQuery also have something equivalent to the execution plan you could show?Keesee
@Tajo thanks for the link. My database is not that big, only 16M rows. I agree that BQ is suited for really big tables but I dont think its my caseNurserymaid
@Keesee I will post more details. Yes BQ have execution plan, I can show it. But how it will help you?Nurserymaid
My mine problem as I see it is that I cant up tune the server because Google does not allow itNurserymaid
How to tune the db: cloud.google.com/sql/docs/postgres/flags#consoleNurserymaid
All new data updated. table definition, EXPLAIN (ANALYZE, BUFFERS)Nurserymaid
@Nurserymaid On your "No indexes." statement: Do you mean there are none, but you do consider adding some if they would get you closer to BigQuery performance, or that they are not allowed in your setup? Also, are you free to change the table structure?Gapin
N
1

For anyone coming here wondering how to tune their postgres machine on cloud sql they call it flags and you can do it from the UI although not all the config options are edit able.

https://cloud.google.com/sql/docs/postgres/flags#console

Nurserymaid answered 12/11, 2021 at 0:35 Comment(0)
M
1

The initial query looks overcomplicated. It could be rewritten as:

SELECT  v."title"
FROM public.videos v
WHERE array_to_string(v.categories, '^') ILIKE ALL (ARRAY['%thriller%', '%crime%'])
ORDER BY views DESC 
LIMIT 12 OFFSET 0;

db<>fiddle demo

Muscadel answered 14/11, 2021 at 11:15 Comment(1)
Thanks for the answer. I have tried your solution on my db and its still very very slow. 2.5 minutes to finish. This the EXPLAIN ANALYZE: ufile.io/ahk9qoem I want it to run under 2 seconds like in big queryNurserymaid
V
0

PostGreSQL is very slow by design on every queries involving COUNT aggregate function and there is absolutly nothing to do except materialized view to enforces the performances.

The tests I have made on my machine with 48 cores about COUNT performances compare from PostGreSQL to MS SQL Server is clear : SQL Server is between 61 and 561 times faster in all situations, and with columnstore index SQL Server can be 1,533 time faster…

The same speed is reached when using any other RDBMS. The explanation is clearly the PG MVCC that maintain ghost rows inside table and index pages, that needs to browse every rows to know if it is an active or ghost row... In all the other RDBMS, the count is done by reading only one information at the top of the page (number of rows in the page) and also by using parallelized access or in SQL Server a batch access and not a row access...

There is nothing to do to speed up the count in PG until the storage engine will not been enterely rewrite to avoid ghost slots inside pages...

Volney answered 16/11, 2021 at 15:54 Comment(2)
Thanks for the answer, I tried mysql and its still very very slow. I was using JSON object because there is not array type in mysqlNurserymaid
Arrays and stuff like arrays (JSON, XML, records, multisets...) are opposite to the basics principles of relational model, that induces that data must be atomic. And when it is atomic, it is possible to speed up with indexes. Ecepts in specifice database like Oracle or Microsoft SQL Server that can indexes XML, no other RDBMS knows how to speed up access with indexes on composite structures like ARRAY ! By the way a table is very very close to array and can be indexed...Volney
G
0

I believe you need to use a full-text search and the special GIN index. The steps:

  1. Create the helper function for index: CREATE OR REPLACE FUNCTION immutable_array_to_string(text[]) RETURNS text as $$ SELECT array_to_string($1, ','); $$ LANGUAGE sql IMMUTABLE;

  2. Create index itself: CREATE INDEX videos_cats_fts_idx ON videos USING gin(to_tsvector('english', LOWER(immutable_array_to_string(categories))));

Use the following query: SELECT title FROM videos WHERE (to_tsvector('english', immutable_array_to_string(categories)) @@ (to_tsquery('english', 'thriller & crime'))) limit 12 offset 0;

Be aware that this query has a different meaning for 'crime' and 'thriller'. They are not just substrings. They are tokens in English phrases. But it looks that actually it is better for your task. Also, this index is not good for frequently changed data. It should work fine when you have mostly read-only data.

PS This answer is inspired by answer & comments: https://mcmap.net/q/745822/-postgres-full-text-search-on-array-column

Gargantua answered 27/11, 2021 at 13:17 Comment(0)
L
0

Apart from the sql syntax optimization, have you tried Postgresql tune?

I check the explaination has found only two workers in parallel and 25KMemory used in sorting.

Workers Planned: 2" Sort Method: quicksort Memory: 25kB"

For your query, it is typical OLAP query. it performance usually related the memory(memory and cpu cores used(workers). The default postgres use KB level memory and few workers. You can tune your postgresql.conf to optimized it work as OLAP type database.

=================================================== Here is my suggestion: use more memory(9MB as work mem ) and more cpu(max 16)

# DB Version: 13
# OS Type: linux
# DB Type: dw
# Total Memory (RAM): 24 GB
# CPUs num: 16
# Data Storage: ssd

max_connections = 40
shared_buffers = 6GB
effective_cache_size = 18GB
maintenance_work_mem = 2GB
checkpoint_completion_target = 0.9
wal_buffers = 16MB
default_statistics_target = 500
random_page_cost = 1.1
effective_io_concurrency = 200
work_mem = 9830kB
min_wal_size = 4GB
max_wal_size = 16GB
max_worker_processes = 16
max_parallel_workers_per_gather = 8
max_parallel_workers = 16
max_parallel_maintenance_workers = 4

You can add it to you postgresql.conf last line. And restart your postgresql server to make it effect.

To further optimization,

  1. reduce the connection and increase the work_mem. 200* 9830 is about 2GB memory for all connections. If you has less( for example, 100) connections, you can get more memory for query working.

==================================== Regarding using text array type and unnest. you can try to add proper index.

That's all, good luck.

WangYong

Linn answered 28/11, 2021 at 4:6 Comment(1)
The question is about PostgreSQL on Google Cloud SQL - there is no direct access to postgresql.conf. effective_io_concurrency, max_parallel_workers_per_gather, max_parallel_workers, max_parallel_maintenance_workers settings are not made available through instance customisation flags.Gapin
C
0

You can modify the performance parameter settings under Flags section Ex.enter image description here

Cite answered 17/4 at 13:19 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.