Set up Postgres replication #160

Closed
opened 2020-04-25 10:13:17 +00:00 by raucao · 20 comments
Owner

@greg already started working on this, but there was no issue for it yet

@greg already started working on this, but there was no issue for it yet
greg was assigned by raucao 2020-04-25 10:13:17 +00:00
raucao added a new dependency 2020-04-25 10:13:57 +00:00
Owner

As discussed in #147 it makes sense to first upgrade the current PostgreSQL server on andromeda to version 12, since the replication configuration has been improved (and simplified). In version 10 the replication settings had to be added to a config file in the data directory, since 12 they live in postgresql.conf with the rest of the config.

I have followed this tutorial to set up replication between two VMs: https://www.percona.com/blog/2019/10/11/how-to-set-up-streaming-replication-in-postgresql-12/

The new set up will have PostgreSQL run with TLS enabled, with certs generated by Let's Encrypt, so the clients and replication can be achieved securely.

In order to secure the servers, we will use a firewall rule that only allows connections from the IP of the servers that need it. We will also use client authentication to limit access to specific databases for each server, and allow for the replication server to the replication database

Now I need to remove some hardcoded values from the new recipes, using instead values from the Chef nodes living in the repo in nodes

I think we can start with the upgrade to 12 on Andromeda and switch to the new config, maybe on Tuesday? After this first step that will require a short downtime, we can install PostgreSQL to the new Hetzner server and confirm replication works, this will not require downtime

As discussed in #147 it makes sense to first upgrade the current PostgreSQL server on andromeda to version 12, since the replication configuration has been improved (and simplified). In version 10 the replication settings had to be added to a config file in the data directory, since 12 they live in `postgresql.conf` with the rest of the config. I have followed this tutorial to set up replication between two VMs: https://www.percona.com/blog/2019/10/11/how-to-set-up-streaming-replication-in-postgresql-12/ The new set up will have PostgreSQL run with TLS enabled, with certs generated by Let's Encrypt, so the clients and replication can be achieved securely. In order to secure the servers, we will use a firewall rule that only allows connections from the IP of the servers that need it. We will also use [client authentication](https://www.postgresql.org/docs/12/auth-pg-hba-conf.html) to limit access to specific databases for each server, and allow for the replication server to the replication database Now I need to remove some hardcoded values from the new recipes, using instead values from the Chef nodes living in the repo in `nodes` I think we can start with the upgrade to 12 on Andromeda and switch to the new config, maybe on Tuesday? After this first step that will require a short downtime, we can install PostgreSQL to the new Hetzner server and confirm replication works, this will not require downtime
Owner

This took more work than I originally thought to get it right, but I'm getting satisfied with the Chef resource I wrote to set up a PostgreSQL primary or replica, with a TLS cert provided by Let's Encrypt.

Now I'm going to perform the PostgreSQL upgrade in a VM and make sure everything goes smoothly

This took more work than I originally thought to get it right, but I'm getting satisfied with the Chef resource I wrote to set up a PostgreSQL primary or replica, with a TLS cert provided by Let's Encrypt. Now I'm going to perform the PostgreSQL upgrade in a VM and make sure everything goes smoothly
Author
Owner

Why would we need Let's Encrypt tho? We don't want any random public clients to access it.

Why would we need Let's Encrypt tho? We don't want any random public clients to access it.
Owner

I ran into a Chef bug that appears to be caused by the implementation of custom resources. Notifications are broken, so adding an access rule using the postgresql_access does not reload the service: https://github.com/sous-chefs/postgresql/issues/648

The custom resource I wrote was really useful since so much code is identical between a primary and a replica, but I might have to rewrite it another way (move the access rules to the recipe, for example). With this bug reloading the PostgreSQL service manually is needed after adding a new rule

I created a repo to reproduce the issue, a maintainer will take a look at it

I ran into a Chef bug that appears to be caused by the implementation of custom resources. Notifications are broken, so adding an access rule using the `postgresql_access` does not reload the service: https://github.com/sous-chefs/postgresql/issues/648 The custom resource I wrote was really useful since so much code is identical between a primary and a replica, but I might have to rewrite it another way (move the access rules to the recipe, for example). With this bug reloading the PostgreSQL service manually is needed after adding a new rule I created a repo to reproduce the issue, a maintainer will take a look at it
Owner

re: TLS, it's still useful between clients and servers even with firewall rules and access rules

re: TLS, it's still useful between clients and servers even with firewall rules and access rules
Author
Owner

I didn't ask why we need TLS, but why we would need DNS and Let's Encrypt. It seems like you're proposing to use both in addition to TLS.

(Also, TLS is only needed when we use a public network, and adds overhead as well.)

I didn't ask why we need TLS, but why we would need DNS and Let's Encrypt. It seems like you're proposing to use both in addition to TLS. (Also, TLS is only needed when we use a public network, and adds overhead as well.)
Author
Owner

There is an open PR now, and the question why we need DNS and LE is still left unanswered by it. I think a normal X509 cert, based on a Kosmos root CA that all machines can use, would have been much easier to begin with, but also that it makes more sense overall.

With DNS, we have to change hostnames instead of IPs, which requires waiting for the cache to update. And LE isn't required at all when there are no public clients using DNS. Also, the cert is then tied to that hostname, while our own cert would be valid on any IP/machine.

There is an open PR now, and the question why we need DNS and LE is still left unanswered by it. I think a normal X509 cert, based on a Kosmos root CA that all machines can use, would have been much easier to begin with, but also that it makes more sense overall. With DNS, we have to change hostnames instead of IPs, which requires waiting for the cache to update. And LE isn't required at all when there are no public clients using DNS. Also, the cert is then tied to that hostname, while our own cert would be valid on any IP/machine.
Owner

TLS with Let's Encrypt is easier to set up than a self signed cert on all machines. In my experience generating self-signed certs in a pain in the ass.

If the primary PostgreSQL server is for example andromeda.kosmos.org with TLS, changing it to another server in the client configs is the same amount of work as changing an IP (changing the FQDN)

TLS with Let's Encrypt is easier to set up than a self signed cert on all machines. In my experience generating self-signed certs in a pain in the ass. If the primary PostgreSQL server is for example andromeda.kosmos.org with TLS, changing it to another server in the client configs is the same amount of work as changing an IP (changing the FQDN)
Author
Owner

TLS with Let's Encrypt is easier to set up than a self signed cert on all machines. In my experience generating self-signed certs in a pain in the ass.

How is LE easier than running literally a single command on any POSIX system with OpenSSL installed? Storing it in a data bag is as easy.

If the primary PostgreSQL server is for example andromeda.kosmos.org with TLS, changing it to another server in the client configs is the same amount of work as changing an IP (changing the FQDN)

That completely disregards my criticism of using DNS here. And if you change the entire hostname, then it's obviously more work on both ends (plus the DNS entry itself) than simply changing the IP only. Not even speaking of naming and such in that case.

> TLS with Let's Encrypt is easier to set up than a self signed cert on all machines. In my experience generating self-signed certs in a pain in the ass. How is LE easier than running literally a single command on any POSIX system with OpenSSL installed? Storing it in a data bag is as easy. > If the primary PostgreSQL server is for example andromeda.kosmos.org with TLS, changing it to another server in the client configs is the same amount of work as changing an IP (changing the FQDN) That completely disregards my criticism of using DNS here. And if you change the entire hostname, then it's obviously more work on both ends (plus the DNS entry itself) than simply changing the IP only. Not even speaking of naming and such in that case.
Author
Owner

... also not mentioning short expiration dates and frequent renewals, of course. None of that is necessary with a connection between private servers, where the only point is to encrypt the connection between the particular, internal machines.

... also not mentioning short expiration dates and frequent renewals, of course. None of that is necessary with a connection between private servers, where the only point is to encrypt the connection between the particular, internal machines.
Owner

I have pushed changes to my PR (#163) to switch over to a self-signed certificate. In VMs I got it to work, including with psql's verify-ca option.

In my experiments I could not get ejabberd's sql_ssl_cafile and sql_ssl_verify to ever work. At first I thought it was because my cert was a wildcard cert, but I got the same issue with a certicate generated for an example domain (primary.kosmos.org with an /etc/hosts entry) too.

I am always getting this error in ejabberd, so we will do have to do without cert validation, or create an issue: TLS client: In state certify at ssl_handshake.erl:1372 generated CLIENT ALERT: Fatal - Handshake Failure - {bad_cert,hostname_check_failed}.

I also noticed that ejabberd does not support connecting to PostgreSQL servers with only TLS v1.3 enabled, so I have configured it to enable a minimum of v1.2 (0063776297/site-cookbooks/kosmos-postgresql/resources/server.rb (L90))

I have pushed changes to my PR (#163) to switch over to a self-signed certificate. In VMs I got it to work, including with psql's [`verify-ca`](https://www.postgresql.org/docs/12/libpq-ssl.html) option. In my experiments I could not get ejabberd's `sql_ssl_cafile` and [`sql_ssl_verify`](https://docs.ejabberd.im/admin/configuration/toplevel/#sql-ssl-verify) to ever work. At first I thought it was because my cert was a wildcard cert, but I got the same issue with a certicate generated for an example domain (`primary.kosmos.org` with an `/etc/hosts` entry) too. I am always getting this error in ejabberd, so we will do have to do without cert validation, or create an issue: `TLS client: In state certify at ssl_handshake.erl:1372 generated CLIENT ALERT: Fatal - Handshake Failure - {bad_cert,hostname_check_failed}`. I also noticed that ejabberd does not support connecting to PostgreSQL servers with only TLS v1.3 enabled, so I have configured it to enable a minimum of v1.2 (https://gitea.kosmos.org/kosmos/chef/src/commit/006377629734ed96019321f67961022f607ed1a5/site-cookbooks/kosmos-postgresql/resources/server.rb#L90)
Author
Owner

Sounds to me like this is all much more complicated than just setting up a private network to be honest.

But in any case, does that mean ejabberd comes with its own CA list/store, or why would it ignore a Kosmos CA cert from the system store (like when you just use update-ca-certificates e.g.)?

Sounds to me like this is all much more complicated than just setting up a private network to be honest. But in any case, does that mean ejabberd comes with its own CA list/store, or why would it ignore a Kosmos CA cert from the system store (like when you just [use `update-ca-certificates`](https://unix.stackexchange.com/questions/90450/adding-a-self-signed-certificate-to-the-trusted-list/132163#132163) e.g.)?
Owner

In ejabberd you pass it a path to the CA root certs (https://docs.ejabberd.im/admin/configuration/toplevel/#sql-ssl-cafile). This part works, it fails verifying the hostname

In ejabberd you pass it a path to the CA root certs (https://docs.ejabberd.im/admin/configuration/toplevel/#sql-ssl-cafile). This part works, it fails verifying the hostname
Author
Owner

it fails verifying the hostname

What hostname? Why would we use hostnames?

> it fails verifying the hostname What hostname? Why would we use hostnames?
Author
Owner

... also, aside from not needing a hostname, why would we have to verify the cert on every connection? Someone who could replace the IP could also replace the local cert used.

... also, aside from not needing a hostname, why would we have to verify the cert on every connection? Someone who could replace the IP could also replace the local cert used.
Owner

Good implementations of PostgreSQL clients give you the option of verifying the cert was generated with the root cert, verify-ca. For example Gitea calls it verify-ca like the PostgreSQL documentation: https://docs.gitea.io/en-us/config-cheat-sheet/#database-database

Since we decided to use IPs, we wouldn't want to check the hostname (or use it), but ejabberd has no option for verify-ca, so ejabberd wouldn't verify anything

See https://www.postgresql.org/docs/12/libpq-ssl.html#LIBPQ-SSL-PROTECTION for the security implication of every option

Good implementations of PostgreSQL clients give you the option of verifying the cert was generated with the root cert, `verify-ca`. For example Gitea calls it `verify-ca` like the PostgreSQL documentation: https://docs.gitea.io/en-us/config-cheat-sheet/#database-database Since we decided to use IPs, we wouldn't want to check the hostname (or use it), but ejabberd has no option for `verify-ca`, so ejabberd wouldn't verify anything See https://www.postgresql.org/docs/12/libpq-ssl.html#LIBPQ-SSL-PROTECTION for the security implication of every option
Author
Owner

I'm not sure I follow. The point of verifying a hostname is that the IP can change. That's the problem with DNS.

Thus, the question is still open: why would we have to verify the cert on the client in this scenario? It is a simple local config. And it is only there to encrypt the connection, so that someone outside of our machines cannot record unencrypted traffic, when it's flowing through a public network. Nothing else.

So the options for this use case are:

  1. Use a private network
  2. Encrypt the connection on a public network

The easiest way to achieve number 2 is with simple x509 certificates that we ourselves configure on the machines. No?

I'm not sure I follow. The point of verifying a hostname is that the IP can change. That's the problem with DNS. Thus, the question is still open: why would we have to verify the cert on the client in this scenario? It is a simple local config. And it is only there to encrypt the connection, so that someone *outside of our machines* cannot record unencrypted traffic, when it's flowing through a public network. Nothing else. So the options for this use case are: 1. Use a private network 2. Encrypt the connection on a public network The easiest way to achieve number 2 is with simple x509 certificates that we ourselves configure on the machines. No?
Owner

If we only want to encrypt the traffic and do not care about the root cert's validity then there's no need to create an issue for ejabberd, or to wait for a verify-ca feature.

I think the PR is good to go, since right now Mastodon and ejabberd connect to PostgreSQL using localhost, so the ejabberd cookbook will only need to be updated to create a cluster with the new server

If we only want to encrypt the traffic and do not care about the root cert's validity then there's no need to create an issue for ejabberd, or to wait for a `verify-ca` feature. I think the PR is good to go, since right now Mastodon and ejabberd connect to PostgreSQL using localhost, so the ejabberd cookbook will only need to be updated to create a cluster with the new server
Owner

Replication is now running on centaurus. I had to fix the firewall rules to make it work: #167. The data dir for PostgreSQL is also encrypted using encfs: #166. Both are ready to merge

Replication is now running on centaurus. I had to fix the firewall rules to make it work: #167. The data dir for PostgreSQL is also encrypted using encfs: #166. Both are ready to merge
Author
Owner

Splendid!

Splendid!
Sign in to join this conversation.
No Milestone
No project
No Assignees
2 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Reference: kosmos/chef#160
No description provided.