I would like to present the check-list which will be extremely helpful in the following cases:
- There is a DBMS and it should be taken for the administration;
- You have to study a new DBMS as soon as possible, you have never worked with it but soon it will be used within your project;
- You possess a «zoo» containing a wide range of DBMSs and you should standardize an approach in order to work with all of them;
- It is necessary to audit a certain DBMS.
This article will be of a great interest not only for the DBA, but also for the people who are in charge of the DBMS support and maintenance, for example for the system administrators, DevOps and SRE.
As the article appeared to be quite “heavy”, I decided to describe 2 options: the first one contains the recommendations which will help you to create MVP, the second one has more detailed settings and best practices, I will hide it in a spoiler. Some parts for the «advanced users» you will see in a spoiler too.
How to collect all the necessary information concerning a database you are interested in
If you have never worked with a DBMS which is to be launched in the nearest future, the first step is very prosaic – it is a research – you should take the documentation, articles and videos with good ratings/views and study it carefully.
The minimum time which should be reserved for this stage is a month.
I will add an unobvious but very useful resource of knowledge. This is a repository, I prefer Ansible Galaxy.
The rest of the article will be based on the assumption that we will automate all actions (I prefer Ansible and Terraform). But there will be no code in this article. We will look at the approach itself, when we initially put everything into automation and use Git as a source of primary database information.
Primary deployment
OS/k8s/docker/whatever settings
Before configuring, it is worthwhile to have documentation on the settings and indicate the reasons for choosing certain settings along with the dates of changes and references to documentation, DBMS version.
The second step is to study the resources of the layer under the database:
- What scaling options will it give to us: whether we can increase or decrease CPU/RAM/disk;
- Which file systems are available out of the box and which should be added in the Terraform configuration;
- What kind of network we will have among the nodes: flat or routed.
- If quorum maintenance tools like etcd or zookeeper will be used, is it possible to place them on separate instances with separate disks in case of high DBMS loads.
The third step is to determine the size of the dataset and the required performance (I will cover the scaling options in more detail in the part on DBMS tuning). For example, if the load is expected to be 1-10 transactions per second, it is not worth spending time for superfine database tuning.
On the other hand, we must foresee a scalability possibility and think about the configuration for the future workloads. This may seem excessive in the early stages, but when the base reaches a certain size, you won’t have to reboot to apply new parameters.
It’s worth automating the configuration you’ve chosen right away. For some databases there is already automation ready, e.g. MongoDB has the mongodb_linux role in the mongodb community collection in Ansible Galaxy.
On the other hand, we must allow for scalability and think about the configuration for future workloads. This may seem excessive in the early stages, but when the base reaches a certain size, you won’t have to reboot to apply new parameters.
It’s worth automating the configuration you’ve chosen right away. For some databases there is already automation ready, e.g. MongoDB has the mongodb_linux role in the mongodb community collection in Ansible Galaxy.
Here are some important system settings
1) Kernel
- Transparent Huge Pages. To use or not, with what flags.
- Swappiness, usually 1 or 0.
- Limits on ulimit quantities, such as LimitFSIZE, LimitCPU, LimitAS, LimitMEMLOCK, LimitNOFILE, LimitNPROC.
- Linux kernel version relevance.
- Interprocess communication and shared memory settings (kernel ipc, shmem) if DBMS access parallelism is implemented using linux ipc (hello, PostgreSQL).
2) Compute
- Some specific settings and parameters, e.g. NUMA, exposed flags.
- Scaling governor selection. Usually limited to max performance mode to keep the system from going into power saving mode.
3) Memory
- Vm.* kernel parameters (overcommit_memry, vm_dirty_ratio, vm_dirty_background_ratio).
- Special tuning may be required depending on the type of memory allocation by the DBMS engine.
4) Disk. It is one of the most expensive and slow resources for the DBMS, that is why you should pay special attention to its tuning:
- Selecting the file subsystem and its parameters.
- Usually the documentation explicitly recommends using one of the file subsystems or offers two or three of them as equivalent.
- For example, XFS is very fast with very large files, but cannot shrink in size, and ext4 can, but works worse with large files.
- The choice of I/O scheduler depends on the type of discs used (NVMe SSD/SSD/HDD). Usually choose between noop or deadline.
- Choice of RAID configuration. It is better to take RAID10, but if it is expensive, in some cases you can do with RAID5 or RAID6.
- SWAP. I initially put an emphasis on the clustered DBMS deployment option. If you don’t have replicas, swap will help you last longer in case of memory shortage to the detriment of performance. But in normal life, it is almost always unnecessary. If we can sacrifice a small portion of data lost before the service dies as a result of an out of memory killer (OOM) in exchange for performance and uptime, we will almost always configure the system that way.
5) Network
- Firewall. Using the conntrack module built into firewalld, iptables and other software firewalls can significantly reduce the performance of a DBMS server. If it is possible to use external solutions, it is better to choose them.
- List of used ports. From the very beginning, keep a separate track of the ports used to support the DBMS ecosystem. Thus, when hardening the network interactions, there is less probability that the network engineers will affect the DBMS operation in any way.
- If you plan to use ipv6, prepare the DBMS service configuration and find out the nuances of its setting.
- Net.* kernel settings for a large number of open sockets and other settings if required by the base (for example, net.core.somaxconn, net.ipv4.tcp_fin_timeout, net.ipv4.tcp_keepalive_intvl, net.ipv4.tcp_keepalive_time, net.ipv4.tcp_max_syn_backlog).
6) Shared resources
- NTP. For databases, it is very important, and in sharded variants even critical, to have properly configured time servers.
- DNS. Some databases rely on FQDNs for inter-server communications. It is useful to additionally configure /etc/hosts to not depend on failures and fast DNS servers. As an additional measure, you can put an immutable flag on this file after configuration.
- Repository. It is important to maintain a separate track of the necessary packages to keep the DBMS running from the very beginning. If you don’t have your own server with repositories now, it may appear later, and if you do, you can quickly restore the necessary repositories in case of its crash
DBMS configuration
In almost every DBMS configuration guide you will see that there is no universal recipe for initial configuration, and there are some recommendations concerning the best way to configure it according to a certain of load and database size. As I have already mentioned above, the discs are the slowest and most expensive resource, that is why databases can be divided into three types of disc and memory usage:
- All data from a table / collection / other entities fits into RAM.
- The largest index fits into RAM.
- Even indexes so large that they don’t fit into RAM.
According to these postulates, I would make three configuration options where numerical values are scaled automatically using coefficients depending on the type of configuration. For example, you could write a shared_buffers calculator for PostgreSQL. Depending on the size of the instance, the calculation techniques for data sizes on the order of a few gigabytes and terabytes would be different. This will cover 90% of the needs, it remains to add a fourth custom configuration type for special loads.
Modifications and releases
While designing the automation you should consider that configuration changes that should be made over the time, so initially think about mechanisms for making changes to:
- RBAC and account automation;
- modification and application of settings;
- tests after applying the settings (integration, load, functional), benchmarking.
It is desirable that making changes will not cause down time for the users and that the DBMS client will not notice anything at all.
Before making changes, think over a rollback plan for each step of making the changes.
Clustering
MVP
In the minimum variant it is necessary to write automation for database cluster deployment or take a ready-made one and adapt variables to your infrastructure and move to the next point. And everything that is under the hood will go into the backlog and will be implemented in the process of implementation.
Advanced level:
Prior to bring the DBMS into the stack, you should to investigate available replication and load switching methods.
The following scenarios should be considered:
- routine manual load switching;
- automatic load switching after a single trip;
- behaviour during repeated network outages of non-deterministic duration;
- stopping the load switching automation (useful during works).
Here is a list of the questions you should know how to answer to:
- Whether synchronous replication is necessary or not (if it is supported by the DBMS at all).
- How replication and connectivity errors are handled. For example, if a connection to a client is lost, a transaction was executed and a commit was sent, but there is no response that the commit reached.
- Is there a softer alternative to synchronous replication like readPreference/writePreference as in MongoDB.
- What are the allowable RPOs per load switch and how is the replica selected for load switching.
- Which clustering mechanisms is to be used
- You should agree to use mongodb URI, postgresql URI or their analogues, if available, and if not, then switch virtual IP address using keepalived, vip-manager. You can also put a proxy for requests, not forgetting to cluster it. The proxy can be nginx/haproxy or a special proxy for a particular DBMS.
- Think through unavailability scenarios, how will clustering work out? For example, the hardest one is multiple network connection failures between nodes of arbitrary duration.
- What kind of CAP-theorem system (CA/AP/CP).
- How will backup work after load shifting.
- How monitoring will work after load shifting (set up alertering for a failover event).
Backup system
MVP
In the simplest case, we will implement the backup using the tools supplied together with the DBMS. It is important to follow the documentation instructions and, if possible, take the backup copies from the replica, and take care of the mechanism for selecting the replica to take the backup copy. If the documentation recommends taking a backup copy from the master node in the replication cluster, then make a mechanism to verify that the node is the master node in the cluster and take a copy from it. After taking the backup, depending on the size of the database, you can apply compression while sending it to the remote server.
Advanced level
Gather all the information from the business on RPO (how much data can be lost) and RTO (how long it takes to recover): how long will the company be ready to do without a business function?
Also you should reserve some time in case of an error while deploying from the backup.
Based on these metrics, we will build a backup system. Typically, these are utilities that allow you to take a backup in logical or physical form with or without support for restoring to a point in time.
RTO:
If a base is large, in the most demanding option, one or more copies can be stored locally on a node with the same resource characteristics as the master node in the replication cluster. Then comes fast s3 and other backup storage methods.
RPO:
Good performance can be achieved in an extreme case, such as Postgres, inspired by the “lossless CUC” report.
Also if data is inserted from Kafka, one can vouch with its administrators that Kafka is backed up and stores data for a long time, get approval from the customers that recovery to the point in time will be done using database backup + reinserting data from Kafka. In this case, you can only take a backup once a day.
The backup system should also be monitored and go through resource consumption planning.
It’s important to think about automated restore scenarios when it is necessary to:
- Restore a single node in a cluster. There are typically three scenarios: refill data from a neighbouring replica, take a backup from a neighbouring node and restore it to the damaged node, or restore from a backup.
- Restore all nodes in the cluster on top of the same cluster. This is the most painful case. For example, a developer accidentally corrupted or deleted data. It can be when, for example, delete was called without a condition in PostgreSQL, out instead of merge was called in MongoDB in aggregation pipeline.
- Restore the data to the test environment with obfuscation of vulnerable data.
Monitoring systems
MVP
The monitoring system should include:
- System metrics, the de facto standard here is Prometheus Node Exporter. You can use other monitoring, capturing essentially the same resource consumption metrics, i.e. CPU, RAM, network, disk.
- Query analytics.
- Collect logs of errors and heavy requests.
Standard monitoring and alerting of a virtual machine can be implemented using Prometheus/VictoriaMetrics + Alertmanager, Zabbix stack.
Logging – with the help of ELK-stack or old-fashioned journald. Besides, for each database there are different solutions for log processing and query analytics.
Advanced level
You can also use Percona Monitoring and Management for MySQL, MongoDB, PostgreSQL to collect logs.
Pgwatch2 and pgbadger for PostgreSQL. Just in case, I’ll state the obvious – logs are stored on remote servers.
It is important to take diagnostics on the server during the death agony before the final demolition. Two minutes of information gathering will yield a lot of useful information for postmortem, and it’s good practice to acquire tools to gather metrics before taking the DBMS into the stack. For example, in addition to the system traces, PostgreSQL can be debugged with pgcenter and MySQL with pt-stalk.
It is worthwhile to think over a similar toolkit for all the infrastructure elements around the DBMS. We will go through it closer to the end of the article.
Security
I would outline several levels of security. Each of them has its own list of measures:
1. Minimum acceptable level
- Authentication using current algorithms is used. Now is from scram sha 256 and above. Key authentication or AD/Kerberos authentication is possible.
- The role model described earlier is configured and used.
- Access restriction by IP addresses both at the DBMS level and at the firewall level, separate subnets.
- Vulnerable data is obfuscated when copying it to the test environments.
- Security-related monitoring is configured. In the basic implementation, this includes enabling the auditd daemon and processing its logs, settings or external tools for auditing user activity in the DBMS.
- Unused DBMS functionality, such as network port integration, is disabled.
- Regular auditing of security patches, applying updates.
- Repositories caching.
2. Medium level
- Client-server communication goes with SSL/TLS mechanisms of current versions.
- Inter-server communication is encrypted by SSL/TLS.
- Application of multi-factor authentication.
3. High level
- Verification and compliance with the parameters of Federal Law-152, PCI DSS, GDPR and other regulations.
- Disc subsystem encryption mechanisms are used.
- Backups are encrypted, encryption keys are stored separately from the backups. Data in the database is encrypted with the client key. Removing the key renders the data unusable, even on backups.
- RBAC is configured in such a way that the database infrastructure administrator has no access to the data itself.
- Configured row level security, column level security.
- Query auditing, extended journaling, and the same requirements are applied to the logs as to the data in the database.
- DBMS isolation both physically and in a segment isolated from the Internet, maintenance from authorized workstations only.
RBAC and automation of accounts
MVP
it’s important to lay down a clear role model for data access right away. Here are the load profiles for which I prepared the following roles:
- username_dbadm – for implementing database changes (DDL for releases);
- username_rw – for service operation;
- username_ro – for analysts and “to watch”.
It is important to limit the scope of privileges, roles, accounts within one database within the DBMS. That is, if a user has the username_dbadm role, it must apply to a specific database and not to everything inside the DBMS.
Advanced level
We also should remember about IP-address access restriction, so that the developer, even if he knows the DBMS password, cannot authorize in it from his workplace to run the service.
For example, the standard ansible role from the community ansible collection for MongoDB does not support this functionality, but MongoDB syntax itself does. In PostgreSQL, it’s hard to automate adding records to hba via patroni. Clickhouse has little ansible ready at all, especially for RBAC, which is what appeared in the DBMS a couple of years ago.
Some databases have the ability to logically organise the contents of the database into different namespaces using schemas (database schema). If this entity is available in your DBMS, you should use it.
Secrets are best automatically generated and published in a password vault or using AD/Kerberos.
How to ensure data durability
I have deliberately separated this point as the durability is one of the fundamental properties in databases. In general terms, several questions need to be answered:
- How is journaling performed? Does the database have a pre-write mechanism? Often it is write-ahead logging, sometimes it is found in the form of redo log + undo log. Separately we can single out the mechanism of binary log in MySQL, oplog – in MongoDB.
- Does the database have a checkpointing mechanism? Is snapshotting used?
- Are checksums calculated at different stages of writing data to disc?
- How is replication configured? Do we need synchronous mode? Multimaster? (This point has already been covered in some detail.)
- Features of data management engines in terms of durability (MySQL InnoDB, MongoDB WiredTiger).
- Is transactionality of DDL queries supported?
- Consensus and quorum maintenance algorithms, if there are any.
- Whether sharding is supported or not. Here it is necessary to study the mechanisms of maintaining durability in the sharded version of the DBMS.
- How is data versioning and obsolescence performed?
The ecosystem around the DBMS
For each element of the ecosystem, we need to go through the same things we did to prepare the DBMS itself. This will include:
- Backup tools that come separately from the DBMS
- Monitoring and alerting tools
- Query proxying and multiplexing systems
- Log analysers
- Quorum maintenance tools (etcd, ZooKeeper, consul, whatever)
- A big huge world of everything else that can be related to your DBMS in any way. For example, from https://mad.firstmark.com/
The whole ecosystem needs to meet certain security requirements, be backed up and automated.