Set up Postgres replication #160

New Issue

raucao · 2020-04-25T10:13:17Z

raucao commented

2020-04-25 10:13:17 +00:00

@greg already started working on this, but there was no issue for it yet

greg was assigned by raucao

2020-04-25 10:13:17 +00:00

raucao added a new dependency 2020-04-25 10:13:57 +00:00

#147 Prepare Gitea migration (from GKE to Kosmos server)

greg commented

2020-05-01 10:01:39 +00:00

As discussed in #147 it makes sense to first upgrade the current PostgreSQL server on andromeda to version 12, since the replication configuration has been improved (and simplified). In version 10 the replication settings had to be added to a config file in the data directory, since 12 they live in postgresql.conf with the rest of the config.

I have followed this tutorial to set up replication between two VMs: https://www.percona.com/blog/2019/10/11/how-to-set-up-streaming-replication-in-postgresql-12/

The new set up will have PostgreSQL run with TLS enabled, with certs generated by Let's Encrypt, so the clients and replication can be achieved securely.

In order to secure the servers, we will use a firewall rule that only allows connections from the IP of the servers that need it. We will also use client authentication to limit access to specific databases for each server, and allow for the replication server to the replication database

Now I need to remove some hardcoded values from the new recipes, using instead values from the Chef nodes living in the repo in nodes

I think we can start with the upgrade to 12 on Andromeda and switch to the new config, maybe on Tuesday? After this first step that will require a short downtime, we can install PostgreSQL to the new Hetzner server and confirm replication works, this will not require downtime

As discussed in #147 it makes sense to first upgrade the current PostgreSQL server on andromeda to version 12, since the replication configuration has been improved (and simplified). In version 10 the replication settings had to be added to a config file in the data directory, since 12 they live in `postgresql.conf` with the rest of the config. I have followed this tutorial to set up replication between two VMs: https://www.percona.com/blog/2019/10/11/how-to-set-up-streaming-replication-in-postgresql-12/ The new set up will have PostgreSQL run with TLS enabled, with certs generated by Let's Encrypt, so the clients and replication can be achieved securely. In order to secure the servers, we will use a firewall rule that only allows connections from the IP of the servers that need it. We will also use [client authentication](https://www.postgresql.org/docs/12/auth-pg-hba-conf.html) to limit access to specific databases for each server, and allow for the replication server to the replication database Now I need to remove some hardcoded values from the new recipes, using instead values from the Chef nodes living in the repo in `nodes` I think we can start with the upgrade to 12 on Andromeda and switch to the new config, maybe on Tuesday? After this first step that will require a short downtime, we can install PostgreSQL to the new Hetzner server and confirm replication works, this will not require downtime

👍 1

greg commented

2020-05-07 09:02:13 +00:00

This took more work than I originally thought to get it right, but I'm getting satisfied with the Chef resource I wrote to set up a PostgreSQL primary or replica, with a TLS cert provided by Let's Encrypt.

Now I'm going to perform the PostgreSQL upgrade in a VM and make sure everything goes smoothly

This took more work than I originally thought to get it right, but I'm getting satisfied with the Chef resource I wrote to set up a PostgreSQL primary or replica, with a TLS cert provided by Let's Encrypt. Now I'm going to perform the PostgreSQL upgrade in a VM and make sure everything goes smoothly

raucao commented

2020-05-07 09:03:58 +00:00

Why would we need Let's Encrypt tho? We don't want any random public clients to access it.

greg commented

2020-05-11 09:51:21 +00:00

I ran into a Chef bug that appears to be caused by the implementation of custom resources. Notifications are broken, so adding an access rule using the postgresql_access does not reload the service: https://github.com/sous-chefs/postgresql/issues/648

The custom resource I wrote was really useful since so much code is identical between a primary and a replica, but I might have to rewrite it another way (move the access rules to the recipe, for example). With this bug reloading the PostgreSQL service manually is needed after adding a new rule

I created a repo to reproduce the issue, a maintainer will take a look at it

I ran into a Chef bug that appears to be caused by the implementation of custom resources. Notifications are broken, so adding an access rule using the `postgresql_access` does not reload the service: https://github.com/sous-chefs/postgresql/issues/648 The custom resource I wrote was really useful since so much code is identical between a primary and a replica, but I might have to rewrite it another way (move the access rules to the recipe, for example). With this bug reloading the PostgreSQL service manually is needed after adding a new rule I created a repo to reproduce the issue, a maintainer will take a look at it

👍 1

greg commented

2020-05-11 09:52:02 +00:00

re: TLS, it's still useful between clients and servers even with firewall rules and access rules

raucao commented

2020-05-11 09:53:52 +00:00

I didn't ask why we need TLS, but why we would need DNS and Let's Encrypt. It seems like you're proposing to use both in addition to TLS.

(Also, TLS is only needed when we use a public network, and adds overhead as well.)

I didn't ask why we need TLS, but why we would need DNS and Let's Encrypt. It seems like you're proposing to use both in addition to TLS. (Also, TLS is only needed when we use a public network, and adds overhead as well.)

greg referenced this issue from a commit

2020-05-11 16:27:52 +00:00

Add a custom resource to set up PostgreSQL 12 Supports both primary and replica. The access rules and firewall have to be set up outside of the custom resource, so they are part of the recipes instead Refs #160

greg referenced this issue from a commit

2020-05-11 16:27:52 +00:00

Upgrade PostgreSQL from 10 to 12 Refs #160

greg referenced this issue

2020-05-11 16:31:45 +00:00

Add recipe to set up PostgreSQL replication, rewrite kosmos-postgresql cookbook #163

raucao commented

2020-05-12 09:15:54 +00:00

There is an open PR now, and the question why we need DNS and LE is still left unanswered by it. I think a normal X509 cert, based on a Kosmos root CA that all machines can use, would have been much easier to begin with, but also that it makes more sense overall.

With DNS, we have to change hostnames instead of IPs, which requires waiting for the cache to update. And LE isn't required at all when there are no public clients using DNS. Also, the cert is then tied to that hostname, while our own cert would be valid on any IP/machine.

There is an open PR now, and the question why we need DNS and LE is still left unanswered by it. I think a normal X509 cert, based on a Kosmos root CA that all machines can use, would have been much easier to begin with, but also that it makes more sense overall. With DNS, we have to change hostnames instead of IPs, which requires waiting for the cache to update. And LE isn't required at all when there are no public clients using DNS. Also, the cert is then tied to that hostname, while our own cert would be valid on any IP/machine.

greg commented

2020-05-12 10:03:09 +00:00

TLS with Let's Encrypt is easier to set up than a self signed cert on all machines. In my experience generating self-signed certs in a pain in the ass.

If the primary PostgreSQL server is for example andromeda.kosmos.org with TLS, changing it to another server in the client configs is the same amount of work as changing an IP (changing the FQDN)

TLS with Let's Encrypt is easier to set up than a self signed cert on all machines. In my experience generating self-signed certs in a pain in the ass. If the primary PostgreSQL server is for example andromeda.kosmos.org with TLS, changing it to another server in the client configs is the same amount of work as changing an IP (changing the FQDN)

raucao commented

2020-05-12 10:15:44 +00:00

TLS with Let's Encrypt is easier to set up than a self signed cert on all machines. In my experience generating self-signed certs in a pain in the ass.

How is LE easier than running literally a single command on any POSIX system with OpenSSL installed? Storing it in a data bag is as easy.

If the primary PostgreSQL server is for example andromeda.kosmos.org with TLS, changing it to another server in the client configs is the same amount of work as changing an IP (changing the FQDN)

That completely disregards my criticism of using DNS here. And if you change the entire hostname, then it's obviously more work on both ends (plus the DNS entry itself) than simply changing the IP only. Not even speaking of naming and such in that case.

> TLS with Let's Encrypt is easier to set up than a self signed cert on all machines. In my experience generating self-signed certs in a pain in the ass. How is LE easier than running literally a single command on any POSIX system with OpenSSL installed? Storing it in a data bag is as easy. > If the primary PostgreSQL server is for example andromeda.kosmos.org with TLS, changing it to another server in the client configs is the same amount of work as changing an IP (changing the FQDN) That completely disregards my criticism of using DNS here. And if you change the entire hostname, then it's obviously more work on both ends (plus the DNS entry itself) than simply changing the IP only. Not even speaking of naming and such in that case.

raucao commented

2020-05-12 10:22:49 +00:00

... also not mentioning short expiration dates and frequent renewals, of course. None of that is necessary with a connection between private servers, where the only point is to encrypt the connection between the particular, internal machines.

greg commented

2020-05-13 17:18:38 +00:00

I have pushed changes to my PR (#163) to switch over to a self-signed certificate. In VMs I got it to work, including with psql's verify-ca option.

In my experiments I could not get ejabberd's sql_ssl_cafile and sql_ssl_verify to ever work. At first I thought it was because my cert was a wildcard cert, but I got the same issue with a certicate generated for an example domain (primary.kosmos.org with an /etc/hosts entry) too.

I am always getting this error in ejabberd, so we will do have to do without cert validation, or create an issue: TLS client: In state certify at ssl_handshake.erl:1372 generated CLIENT ALERT: Fatal - Handshake Failure - {bad_cert,hostname_check_failed}.

I also noticed that ejabberd does not support connecting to PostgreSQL servers with only TLS v1.3 enabled, so I have configured it to enable a minimum of v1.2 (

		site-cookbooks/kosmos-postgresql/resources/server.rb
		Line 90 in 0063776297
	
				  additional_config[:ssl_min_protocol_version]  = "TLSv1.2"

)

I have pushed changes to my PR (#163) to switch over to a self-signed certificate. In VMs I got it to work, including with psql's [`verify-ca`](https://www.postgresql.org/docs/12/libpq-ssl.html) option. In my experiments I could not get ejabberd's `sql_ssl_cafile` and [`sql_ssl_verify`](https://docs.ejabberd.im/admin/configuration/toplevel/#sql-ssl-verify) to ever work. At first I thought it was because my cert was a wildcard cert, but I got the same issue with a certicate generated for an example domain (`primary.kosmos.org` with an `/etc/hosts` entry) too. I am always getting this error in ejabberd, so we will do have to do without cert validation, or create an issue: `TLS client: In state certify at ssl_handshake.erl:1372 generated CLIENT ALERT: Fatal - Handshake Failure - {bad_cert,hostname_check_failed}`. I also noticed that ejabberd does not support connecting to PostgreSQL servers with only TLS v1.3 enabled, so I have configured it to enable a minimum of v1.2 (https://gitea.kosmos.org/kosmos/chef/src/commit/006377629734ed96019321f67961022f607ed1a5/site-cookbooks/kosmos-postgresql/resources/server.rb#L90)

raucao commented

2020-05-13 17:49:48 +00:00

Sounds to me like this is all much more complicated than just setting up a private network to be honest.

But in any case, does that mean ejabberd comes with its own CA list/store, or why would it ignore a Kosmos CA cert from the system store (like when you just use update-ca-certificates e.g.)?

Sounds to me like this is all much more complicated than just setting up a private network to be honest. But in any case, does that mean ejabberd comes with its own CA list/store, or why would it ignore a Kosmos CA cert from the system store (like when you just [use `update-ca-certificates`](https://unix.stackexchange.com/questions/90450/adding-a-self-signed-certificate-to-the-trusted-list/132163#132163) e.g.)?

greg commented

2020-05-14 07:02:21 +00:00

In ejabberd you pass it a path to the CA root certs (https://docs.ejabberd.im/admin/configuration/toplevel/#sql-ssl-cafile). This part works, it fails verifying the hostname

raucao commented

2020-05-14 07:41:22 +00:00

it fails verifying the hostname

What hostname? Why would we use hostnames?

> it fails verifying the hostname What hostname? Why would we use hostnames?

raucao commented

2020-05-14 07:43:49 +00:00

... also, aside from not needing a hostname, why would we have to verify the cert on every connection? Someone who could replace the IP could also replace the local cert used.

greg commented

2020-05-14 09:19:03 +00:00

Good implementations of PostgreSQL clients give you the option of verifying the cert was generated with the root cert, verify-ca. For example Gitea calls it verify-ca like the PostgreSQL documentation: https://docs.gitea.io/en-us/config-cheat-sheet/#database-database

Since we decided to use IPs, we wouldn't want to check the hostname (or use it), but ejabberd has no option for verify-ca, so ejabberd wouldn't verify anything

See https://www.postgresql.org/docs/12/libpq-ssl.html#LIBPQ-SSL-PROTECTION for the security implication of every option

Good implementations of PostgreSQL clients give you the option of verifying the cert was generated with the root cert, `verify-ca`. For example Gitea calls it `verify-ca` like the PostgreSQL documentation: https://docs.gitea.io/en-us/config-cheat-sheet/#database-database Since we decided to use IPs, we wouldn't want to check the hostname (or use it), but ejabberd has no option for `verify-ca`, so ejabberd wouldn't verify anything See https://www.postgresql.org/docs/12/libpq-ssl.html#LIBPQ-SSL-PROTECTION for the security implication of every option

raucao commented

2020-05-14 09:27:51 +00:00

I'm not sure I follow. The point of verifying a hostname is that the IP can change. That's the problem with DNS.

Thus, the question is still open: why would we have to verify the cert on the client in this scenario? It is a simple local config. And it is only there to encrypt the connection, so that someone outside of our machines cannot record unencrypted traffic, when it's flowing through a public network. Nothing else.

So the options for this use case are:

Use a private network
Encrypt the connection on a public network

The easiest way to achieve number 2 is with simple x509 certificates that we ourselves configure on the machines. No?

I'm not sure I follow. The point of verifying a hostname is that the IP can change. That's the problem with DNS. Thus, the question is still open: why would we have to verify the cert on the client in this scenario? It is a simple local config. And it is only there to encrypt the connection, so that someone *outside of our machines* cannot record unencrypted traffic, when it's flowing through a public network. Nothing else. So the options for this use case are: 1. Use a private network 2. Encrypt the connection on a public network The easiest way to achieve number 2 is with simple x509 certificates that we ourselves configure on the machines. No?

greg commented

2020-05-14 10:44:17 +00:00

If we only want to encrypt the traffic and do not care about the root cert's validity then there's no need to create an issue for ejabberd, or to wait for a verify-ca feature.

I think the PR is good to go, since right now Mastodon and ejabberd connect to PostgreSQL using localhost, so the ejabberd cookbook will only need to be updated to create a cluster with the new server

If we only want to encrypt the traffic and do not care about the root cert's validity then there's no need to create an issue for ejabberd, or to wait for a `verify-ca` feature. I think the PR is good to go, since right now Mastodon and ejabberd connect to PostgreSQL using localhost, so the ejabberd cookbook will only need to be updated to create a cluster with the new server

greg referenced this issue

2020-05-14 11:11:32 +00:00

Add recipe to set up PostgreSQL replication, rewrite kosmos-postgresql cookbook #163

greg commented

2020-05-15 16:50:57 +00:00

Replication is now running on centaurus. I had to fix the firewall rules to make it work: #167. The data dir for PostgreSQL is also encrypted using encfs: #166. Both are ready to merge

🎉 1

raucao commented

2020-05-15 19:13:46 +00:00

Splendid!

raucao closed this issue

2020-06-02 11:54:04 +00:00

Sign in to join this conversation.