Postgres errors on ARM-based M1 Mac w/ Big Sur
Asked Answered
R

4

28

Ever since I got a new ARM-based M1 MacBook Pro, I've been experiencing severe and consistent PostgreSQL issues (psql 13.1). Whether I use a Rails server or Foreman, I receive errors in both my browser and terminal like PG::InternalError: ERROR: could not read block 15 in file "base/147456/148555": Bad address or PG::Error (invalid encoding name: unicode) or Error during failsafe response: PG::UnableToSend: no connection to the server. The strange thing is that I can often refresh the browser repeatedly in order to get things to work (until they inevitably don't again).

I'm aware of all the configuration challenges related to ARM-based M1 Macs, which is why I've uninstalled and reinstalled everything from Homebrew to Postgres multiple times in numerous ways (with Rosetta, without Rosetta, using arch -x86_64 brew commands, using the Postgres app instead of the Homebrew install). I've encountered a couple other people on random message boards who are experiencing the same issue (also on new Macs) and not having any luck, which is why I'm reluctant to believe that it's a drive corruption issue. (I've also run the Disk Utility FirstAid check multiple times; it says everything's healthy, but I have no idea how reliable that is.)

I'm using thoughtbot parity to sync up my dev environment database with what's currently in production. When I run development restore production, I get hundreds of lines in my terminal that look like the output below (this is immediately after the download completes but before it goes on to create defaults, process data, sequence sets, etc.). I believe it's at the root of the issue, but I'm not sure what the solution would be:

pg_restore: dropping TABLE [table name1]
pg_restore: from TOC entry 442; 1259 15829269 TABLE [table name1] u1oi0d2o8cha8f
pg_restore: error: could not execute query: ERROR:  table "[table name1]" does not exist
Command was: DROP TABLE "public"."[table name1]";
pg_restore: dropping TABLE [table name2]
pg_restore: from TOC entry 277; 1259 16955 TABLE [table name2] u1oi0d2o8cha8f
pg_restore: error: could not execute query: ERROR:  table "[table name2]" does not exist
Command was: DROP TABLE "public"."[table name2]";
pg_restore: dropping TABLE [table name3]
pg_restore: from TOC entry 463; 1259 15830702 TABLE [table name3] u1oi0d2o8cha8f
pg_restore: error: could not execute query: ERROR:  table "[table name3]" does not exist
Command was: DROP TABLE "public"."[table name3]";
pg_restore: dropping TABLE [table name4]
pg_restore: from TOC entry 445; 1259 15830421 TABLE [table name4] u1oi0d2o8cha8f
pg_restore: error: could not execute query: ERROR:  table "[table name4]" does not exist
Command was: DROP TABLE "public"."[table name4]";

Has anyone else experienced this? Any solution ideas would be much appreciated. Thanks!

EDIT: I was able to reproduce the same issue on an older MacBook Pro (also running Big Sur), so it seems unrelated to M1 but potentially related to Big Sur.

Redstone answered 12/1, 2021 at 16:14 Comment(14)
"Bad address" is the message associated with EFAULT, meaning that postgres is passing an invalid pointer to read() or a similar system call. That indicates either a quite low-level bug in postgres or some associated library, or else something like memory corruption due to bad hardware, overheat, etc.Jackinthepulpit
Thanks, Nate. Are there any ways to reliably diagnose whether it's a hardware issue or a low-level Postgres (or associated library) bug?Redstone
I have the same problem with my ARM MacBook Air, but not consistently. If I restart Postgres, the error does not pop up for a while. It's most likely a problem that will persist until there is a native ARM version.Polystyrene
Yup, having the same problem on my M1 Mac Mini. I installed the native version of Postgres via homebrew.Movable
Same problem with my M1 Mini. Native and x86. Oddly enough, restarting Postgres does seems to make the error go away sometimes.Layby
Are you all running Big Sur? I'm starting to wonder if it's an OS issue rather than an ARM/M1 issue.Redstone
Having this exact same issue...Hy
Have you searched Postgres bug reports, or considered submitting one? The multiple reports does make hardware seem unlikely, but the only way to really prove it's a Postgres bug (versus OS bug or something else) is to debug Postgres and determine whether it's doing the right thing or not.Jackinthepulpit
I don't think it's a Big Sur issue. I switched from Intel MacBook Pro 15" running on Big Sur for two months and it didn't have this issue. As for the solution, this might be a little premature, but I deleted the data directory created by Homebrew (/opt/homebrew/var/postgresql@11) and then ran initdb again and I haven't seen the issue again. It's almost been two days now. 🤞Hy
Interesting, mnylen... When you ran initdb again, did you recreate your data directory in the same place/path that Homebrew had created it, or did you put it somewhere else? Nate – I've searched Postgres bug reports but haven't found anything there. As for submitting a bug report, I was admittedly a little intimidated by their bug reporting guidelines.Redstone
@carlhauck I did use the same data directory. I renamed the original postgresql@11 folder as postgresql.bak and then ran initdb again, specifying the exact same path as the original data directory. I however changed the —locale option from the default, but I don’t see how that would affect things. Can’t remember exact command line I used, but the initdb —help should be pretty self explanatory. I also used my own user when running initdb. Of course this will delete all data in your cluster, so make sure to use pg_dump or something to make a backup before deleting the old data directory.Hy
Still going without errors after rerunning initdb.Hy
@Hy Thanks for the update. I tried your initdb solution a couple days ago, but the same issues cropped up again immediately.Redstone
I also had this problem, even on a freshly created databases. Using pg_restore with -j parameter seemed to exaggerate this issue (even on a freshly created db). I used reindexdb template1 and it seems to went away... for now! Good luck tracking it.Lemuroid
G
13

Definitive workaround for this:

After trying all the workarounds in the other answer, I was STILL getting this error occasionally. Even after dumping and restoring the database, switching to M1-native postgres, running all manner of maintenance script, etc.

After much tinkering with postgresql.conf, the only thing that has reliably worked around this issue indefinitely (have not since received the error):

In postgresql.conf, change:

max_worker_processes = 8

to

max_worker_processes = 1

After making this change, I have thrown every test at my previously error-ridden database and it hasn't displayed the same error once. Previously an extraction routine I run on a database of about 20M records would give the bad address error after processing 1-2 million records. Now it completes the whole process.

Obviously there is a performance penalty to reducing the number of parallel workers, but this is the only way I've found to reliably and permanently resolve this issue.

Geronimo answered 18/2, 2021 at 17:58 Comment(2)
Good to know, but it seems more like a workaround than a fix. It sounds like postgres has some race bug that is avoided if the race has only one competitor :)Jackinthepulpit
Thanks Nate -- adjusted the answer as it's definitely a workaround, not a fix given the performance penalty. FWIW, I was led toward reducing parallel workers by way of some "could not map dynamic shared memory segment" errors in the psql log. There is clearly something wrong with memory access and postgres parallel workers on M1 Macs -- it's the only thing about this machine that's frustrating.Geronimo
G
3

UPDATE #2:

WAL Buffer etc. adjustments extended the time between errors, but didn't eliminate it completely. Ended up reinstalling a fresh Apple Silicon version of Postgres using Homebrew then doing a pg_dump of my existing database (experiencing the errors) and restoring it to the new installation/cluster.

Here's the interesting bit: pg_restore failed to restore one of the indexes in the database, and noted it during the restore process (which otherwise completed). My hunch is that corruption or another issue with this index was causing the Bad Address errors. As such, my final suggestion on this issue is to perform pg_dump, then use pg_restore, not pg_dump to restore the database. pg_restore appears to have flagged this issue where pg_dump didn't, writing a clean DB sans the faulty index.

UPDATE:

Continued to experience this issue after attempting several workarounds, including a full pg_dump and restore of the affected database. And while some of the fixes seem to extend the time between occurrences (particularly increasing shared buffer memory), none have proven a permanent fix.

That said, some more digging on postgres mailing lists revealed that this "Bad Address" error can occur in conjunction with WAL (write-ahead-log) issues. As such, I've now set the following in my postgresql.conf file, significantly increasing the WAL buffer size:

wal_buffers = 4MB

and have not experienced the issue since (knock on wood, again).

It makes sense that this would have some effect, as the wal_buffer size increases by default in proportion to the shared buffer size (as aforementioned, increasing shared buffer size provided temporary relief). Anyway, something else to try until we get definitive word on what's causing this bug.


Was having this exact issue sporadically on an M1 MacBook Air: ERROR: could not read block and Bad Address in various permutations.

I read in postgres forum that this issue can occur in virtual machine setups. As such, I assume this is somehow caused by Rosetta. Even if you're using the Universal version of postgres, you're likely still using an x86 binary for some adjunct process (e.g. Python in my case).

Regardless, here's what has solved the issue (so far): reindexing the database

Note: you need to reindex from the command line, not using SQL commands. When I attempted to reindex using SQL, I encountered the same Bad Address error over and over, and the reindexing never completed.

When I reindexed using the command line, the process finished, and the Bad Address error has not recurred (knock on wood).

For me, it was just:

reindexdb name_of_database

Took 20-30 minutes for a 12GB DB. Not only am I not getting these errors anymore, but the database seems snappier to boot. Only hope the issue doesn't return with repeated reads/writes/index creation in Rosetta. I'm not sure why this works... maybe indices created on M1 Macs are prone to corruption? Maybe the indices become corrupt due to write or access because of the Rosetta interaction?

Geronimo answered 2/2, 2021 at 2:53 Comment(7)
Thanks for this, @Ben Wilson. Unfortunately, when I tried reindexing, I got the following: "reindexdb: error: processing of database "database_name" failed: ERROR: could not read block 22 in file "base/16384/16600": Bad address"... I tried running it many times and the block number just kept increasing. I then restarted postgres and tried again and it appeared to do something (there was no feedback from the terminal denoting that anything had occurred, which can sometimes be good news). However, after a bit of navigating around the app in localhost:3000, I started seeing the errors again.Redstone
@carlhauck -- hmmm -- this is the exact same situation I was experiencing using the SQL statements, e.g.: REINDEX DATABASE content_agg. However, using the command line version (reindexdb NAME_OF_DB), for whatever reason, worked and allowed the reindex to complete. One thing to try: shut down your psql instance then restart it and make sure there are no other connections to the DB (from Python, a SQL GUI, a web server, etc.), then try running reindexdb from the command line.Geronimo
– I tried again multiple times with both the command line and SQL statements, ensuring that there were no other connections to the DB. Still getting the "bad address" errors.Redstone
One more thing to try: increasing your shared buffer memory. After I performed the reindex, I was able to to avoid this issue for a few days, but it eventually recurred (not as frequently). This (and this thread: postgresql.org/message-id/…) made me think that there may be a memory issue at play. Change the shared_buffers line in your postgresql.conf file. Mine was set to 128MB -- I set it to 512MB and the issue hasn't yet recurred. You can find your conf file with: psql -U postgres -c 'SHOW config_file'Geronimo
Thanks for following up, @Ben—I really appreciate it. I tried increasing the shared_buffers to 512MB, but that didn't change anything either... What DID change things for me was when my employer sent me an Intel-based MacBook yesterday. It arrived with Catalina installed (I didn't update to Big Sur yet), and everything's running smoothly in Postgres with the same DB that was giving me problems before. Sorry—I know that doesn't solve things for everyone else who's experiencing the issue :/Redstone
One more update for anyone else experiencing this... the issue cropped up AGAIN after the shared_buffer increase. Clearly something postgres-related degrades with time on M1 Macs (the issue doesn't occur for a while after restarting the server). Latest try: export the postgres database as a pgdump (pg_dump db_name > ~/db_name.dump) then reimport it (pg_restore). Since I've done this, the issue hasn't recurred (fingers crossed).Geronimo
It is somehow related to load because I don't get this issue during regular development but it happens during automated test suite (which runs in parallel and does recreate db every second or so for a longer period) - so it's either load-related or vacuum or something, not sure yetScoundrel
S
2

Is it possible that something in the Big Sur Beta 11.3 fixed this issue?

I've been having the same issues as OP since installing PostgreSQL 13 using MacPorts on my Mac mini M1 (now on PostgreSQL 13.2).

I would see could not read block errors:

  1. Occasionally when running ad hoc queries
  2. Always when compiling a book in R Markdown that makes several queries
  3. Always when running VACUUM FULL on my main database (there's about 620 GB in the instance on this machine and the error would be thrown very quickly relative to how long a VACUUM FULL would take).

(My "fix" so far has been to point my Mac to the Ubuntu server I have running in the corner of my office, so no real problem for me.)

But I've managed to do 2 and 3 without the error since upgrading to Big Sur Beta 11.3 today (both failed immediately prior to upgrading). Is it possible that something in the OS fixed this issue?

Stringer answered 23/2, 2021 at 16:54 Comment(3)
I have used my database more intensively since the update above and I haven't seen this issue again (it was really unusable before).Stringer
FWIW i'm still experiencing this issue when setting max_worker_processes to 8 after the latest Big Sur update on a MacBook Air M1Geronimo
I literally went from seeing these issues every time I tried a particular task (#2 in my list, meaning I simply pointed to my Ubuntu PostgreSQL), to never seeing the issue (so I now just use my Mac mini M1 instance). I checked and I too have max_worker_processes set to 8. So mystery continues. (I use MacPorts PostgreSQL and my data directory lives on a separate hard drive.)Stringer
H
-1

I restored postgresql.conf from postgresql.conf.sample (and restarted db server) and it works fine since then.

TBC, I was trying both wal_buffers & max_worker_processes here and it didn't help. I discovered it accidentally because I tried so many things I just needed to go back. I did not reinitiazed whole database or anything like that, just the config file.

Hemlock answered 8/3, 2021 at 10:27 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.