Set up Postgres replication #160
Labels
No Label
service
accounts
service
discourse
service
drone-ci
service
email
service
garage
service
gitea
service
ipfs
service
mastodon
service
postgres
service
remotestorage
service
wiki
service
xmpp
bug
design
dev environment
docs
duplicate
enhancement
feature
good first issue
idea
invalid
kredits-1
kredits-2
kredits-3
on hold
ops
question
release
major
release
minor
release
patch
security
ui/ux
wontfix
No Milestone
No project
No Assignees
2 Participants
Notifications
Due Date
No due date set.
Blocks
Reference: kosmos/chef#160
Loading…
Reference in New Issue
No description provided.
Delete Branch "%!s(<nil>)"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
@greg already started working on this, but there was no issue for it yet
As discussed in #147 it makes sense to first upgrade the current PostgreSQL server on andromeda to version 12, since the replication configuration has been improved (and simplified). In version 10 the replication settings had to be added to a config file in the data directory, since 12 they live in
postgresql.conf
with the rest of the config.I have followed this tutorial to set up replication between two VMs: https://www.percona.com/blog/2019/10/11/how-to-set-up-streaming-replication-in-postgresql-12/
The new set up will have PostgreSQL run with TLS enabled, with certs generated by Let's Encrypt, so the clients and replication can be achieved securely.
In order to secure the servers, we will use a firewall rule that only allows connections from the IP of the servers that need it. We will also use client authentication to limit access to specific databases for each server, and allow for the replication server to the replication database
Now I need to remove some hardcoded values from the new recipes, using instead values from the Chef nodes living in the repo in
nodes
I think we can start with the upgrade to 12 on Andromeda and switch to the new config, maybe on Tuesday? After this first step that will require a short downtime, we can install PostgreSQL to the new Hetzner server and confirm replication works, this will not require downtime
This took more work than I originally thought to get it right, but I'm getting satisfied with the Chef resource I wrote to set up a PostgreSQL primary or replica, with a TLS cert provided by Let's Encrypt.
Now I'm going to perform the PostgreSQL upgrade in a VM and make sure everything goes smoothly
Why would we need Let's Encrypt tho? We don't want any random public clients to access it.
I ran into a Chef bug that appears to be caused by the implementation of custom resources. Notifications are broken, so adding an access rule using the
postgresql_access
does not reload the service: https://github.com/sous-chefs/postgresql/issues/648The custom resource I wrote was really useful since so much code is identical between a primary and a replica, but I might have to rewrite it another way (move the access rules to the recipe, for example). With this bug reloading the PostgreSQL service manually is needed after adding a new rule
I created a repo to reproduce the issue, a maintainer will take a look at it
re: TLS, it's still useful between clients and servers even with firewall rules and access rules
I didn't ask why we need TLS, but why we would need DNS and Let's Encrypt. It seems like you're proposing to use both in addition to TLS.
(Also, TLS is only needed when we use a public network, and adds overhead as well.)
There is an open PR now, and the question why we need DNS and LE is still left unanswered by it. I think a normal X509 cert, based on a Kosmos root CA that all machines can use, would have been much easier to begin with, but also that it makes more sense overall.
With DNS, we have to change hostnames instead of IPs, which requires waiting for the cache to update. And LE isn't required at all when there are no public clients using DNS. Also, the cert is then tied to that hostname, while our own cert would be valid on any IP/machine.
TLS with Let's Encrypt is easier to set up than a self signed cert on all machines. In my experience generating self-signed certs in a pain in the ass.
If the primary PostgreSQL server is for example andromeda.kosmos.org with TLS, changing it to another server in the client configs is the same amount of work as changing an IP (changing the FQDN)
How is LE easier than running literally a single command on any POSIX system with OpenSSL installed? Storing it in a data bag is as easy.
That completely disregards my criticism of using DNS here. And if you change the entire hostname, then it's obviously more work on both ends (plus the DNS entry itself) than simply changing the IP only. Not even speaking of naming and such in that case.
... also not mentioning short expiration dates and frequent renewals, of course. None of that is necessary with a connection between private servers, where the only point is to encrypt the connection between the particular, internal machines.
I have pushed changes to my PR (#163) to switch over to a self-signed certificate. In VMs I got it to work, including with psql's
verify-ca
option.In my experiments I could not get ejabberd's
sql_ssl_cafile
andsql_ssl_verify
to ever work. At first I thought it was because my cert was a wildcard cert, but I got the same issue with a certicate generated for an example domain (primary.kosmos.org
with an/etc/hosts
entry) too.I am always getting this error in ejabberd, so we will do have to do without cert validation, or create an issue:
TLS client: In state certify at ssl_handshake.erl:1372 generated CLIENT ALERT: Fatal - Handshake Failure - {bad_cert,hostname_check_failed}
.I also noticed that ejabberd does not support connecting to PostgreSQL servers with only TLS v1.3 enabled, so I have configured it to enable a minimum of v1.2 (
0063776297/site-cookbooks/kosmos-postgresql/resources/server.rb (L90)
)Sounds to me like this is all much more complicated than just setting up a private network to be honest.
But in any case, does that mean ejabberd comes with its own CA list/store, or why would it ignore a Kosmos CA cert from the system store (like when you just use
update-ca-certificates
e.g.)?In ejabberd you pass it a path to the CA root certs (https://docs.ejabberd.im/admin/configuration/toplevel/#sql-ssl-cafile). This part works, it fails verifying the hostname
What hostname? Why would we use hostnames?
... also, aside from not needing a hostname, why would we have to verify the cert on every connection? Someone who could replace the IP could also replace the local cert used.
Good implementations of PostgreSQL clients give you the option of verifying the cert was generated with the root cert,
verify-ca
. For example Gitea calls itverify-ca
like the PostgreSQL documentation: https://docs.gitea.io/en-us/config-cheat-sheet/#database-databaseSince we decided to use IPs, we wouldn't want to check the hostname (or use it), but ejabberd has no option for
verify-ca
, so ejabberd wouldn't verify anythingSee https://www.postgresql.org/docs/12/libpq-ssl.html#LIBPQ-SSL-PROTECTION for the security implication of every option
I'm not sure I follow. The point of verifying a hostname is that the IP can change. That's the problem with DNS.
Thus, the question is still open: why would we have to verify the cert on the client in this scenario? It is a simple local config. And it is only there to encrypt the connection, so that someone outside of our machines cannot record unencrypted traffic, when it's flowing through a public network. Nothing else.
So the options for this use case are:
The easiest way to achieve number 2 is with simple x509 certificates that we ourselves configure on the machines. No?
If we only want to encrypt the traffic and do not care about the root cert's validity then there's no need to create an issue for ejabberd, or to wait for a
verify-ca
feature.I think the PR is good to go, since right now Mastodon and ejabberd connect to PostgreSQL using localhost, so the ejabberd cookbook will only need to be updated to create a cluster with the new server
Replication is now running on centaurus. I had to fix the firewall rules to make it work: #167. The data dir for PostgreSQL is also encrypted using encfs: #166. Both are ready to merge
Splendid!