When it comes to storing significantly large volumes of time-series data, with several thouthands rows added every second, one of the first questions which comes to one's mind is wether traditional databases such as MySql, Postgres or Sqlite can handle those data volume? Or whether it is needed to use dedicated platforms and architectures such as Hadoop, MongoDB, Click house or influxDB.

The code associated to this post is availaible on github: https://github.com/atsdb/db-perf-test

Introduction

Benchmarks-relevant measurements

There are several aspects to answering this question. First, in terms of performance of inserting new data, but also in terms of reading back from the database to find specific data. If the assumption is that the incoming volume of data is in the order of magnitude of several kilo inserts per second, then the natural question is wether traditional databases like MySQL or Postgres can do it? And the next question is, if they can handle 1K rps, then how what is the maximum speed they can ingest data? 10K, 50K, 100K?

The second aspect is related to the size of the data stored by the database engine. For an application that would collect just 1K rows per second, where one row is defined using 20 bytes (5 times 4 bytes integers), how long would it take to fill a 200GB disk? Assuming that the database stores the data without any extra overhead, then that would take (1000*20)*3600*24 / 180.(1024^3) = 110 days. But in practice, the overhead from the database storage engine can be quite fat, or even very fat... so that the disk would just could fill-up in 1 month or even less.

The third aspect, less important but still critical for assessing the various database, in the the performance of the database server in terms of CPU usage and memory usage. For instance, for a database that would be able to ingest 10K row/sec, how much CPU would the database use? 10%, 50% or 100%? In the following part of this article, we will be looking at the CPU usage, CPU load and system memory usage.

Performing the test using NoSQL schema.

There can be a lot of buzz around the NoSQL meaning, so what is refers to in this article, in the usage of database tables without any kind of jointure, and where one table is used to store has much rows as possible (up to 100 millions or more).

There are 3 kinds of table schema used for the test:

  • slim table: contains 2 columns of type integer (4 bytes), without any index.
  • light table with key and index: This table contains two more table compared to the first table; one used as auto-inc index, and the other one as non-primary index. This is useful to test the index performance for each database engine.
  • large table with key: It contains 10 rows, the first is the primary-index, the second one is 200 characters (variable), and the last 8 integers. This last table is useful to asses the performance of fat table.

Example for the first slim table without index:

CREATE TABLE `test-slim-table` (
  timestamp int(11) DEFAULT NULL,
  value int(11) DEFAULT NULL,
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

Example for the second light table with index:

CREATE TABLE `test-light-table-index` (
  index int(10) unsigned NOT NULL AUTO_INCREMENT,
  timestamp int(11) DEFAULT NULL,
  val1 smallint(6) DEFAULT NULL,
  val2 smallint(6) DEFAULT NULL,
  PRIMARY KEY (idx),
  KEY col1 (col1)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

Testing strategy and scenarios

In order to maximize the speed at which database can ingest rows, three different strategies are defined:

The naive insert works by first doing prepare on an insert query, and then calling exec on the prepared query as fast as possible, using random data for each column. The insert query looks like:

insert into test-light-table-index (index,timestamp,val1,val2) (0,4384932083409,87,39804)

Using go, the naive insert is implemented using the those functions:

func (db *DB) Prepare(query string) (*Stmt, error)
func (db *DB) Exec(query string, args ...interface{}) (Result, error)

Transaction based inserts

The drawback of the first naive method is that the database have to keep the ACID promise for each row inserted, thus typically having to flush the data to disk and recompute the index for each exec. A better way is to use a transaction, and insert the rows into the transaction which then gets a commit every second. The advantage is that the ACID promise only needs to be kept at commit time, thus reducing the disk flushing and index computation overhead.

Using go, the transaction can be handled easily using the two functions:

func (db *DB) Begin() (*Tx, error)
func (tx *Tx) Commit() error

Note that the prepare statement needs to be re-prepared after each new transaction is created.

Shards inserts

The last strategy consists is splitting the incoming data into several schema-identical tables. The way the data is split is not the critical for this article, assuming that the art of find well-balanced hash algorithm is business specific. The most important is wether the usage of table shards can improve the performance, and if yes, what is the optimal number of tables to be created.

For this scenarios, the transaction based insert is used to each table, and the test consist is running several transaction based insert in parallel. Only the overall insert performance is being looked at, i.e only the sum of all tables inserts/second will be investigated.

Preliminary result

Test are running for 10 minutes, on a Azure A8 instance (28GB RAM). All the numbers below are express in rows/seconds, and represent the overall rps (i.e. total number of inserted rows divided by test duration).

Database configuration is off-the-shell, except for the following MySQL / InnoDB tunings:

  • innodb_buffer_pool_size: 11GB
  • innodb_file_per_table: 1

Inserts per Second performance

Note that MySQL/MyISAM transaction insert result is the same as the naive insert result since MyISAM does not support transactions.

Database Enginenaive inserttransaction insertshard insert
Table Schema: Slim
MySQL myISAM18,10418,10458,897
MySQL InnoDB7319,10753,399
Postgres 2,12010,14763,481
Table Schema: Light with index
MySQL myISAM 12,317 12,31726,026
MySQL InnoDB 6606,815 23,415
Postgres 1,607 3,845 30,022
Table Schema: Large
MySQL myISAM 14,377 14,37724,786
MySQL InnoDB 6746,060 17,350
Postgres 1,74510,14724,019

Data for MongoDB, as well as Click-House, Sqlite and Influx DB will be added in a later post

Disk usage performance

Numbers below are expressed in average size in bytes per row (including data and index).

Database EngineSlimLight w. IndexLarge
MyISAM937244
InnoDB3762328
Postgress3644318

MyISAM is outstandanding compared to the other. It stores the data in an optimal way, and yet performs better in terms of write/seconds.