Data Model
Model
GreptimeDB uses the time-series table to guide the organization, compression, and expiration management of data. The data model is mainly based on the table model in relational databases while considering the characteristics of metrics, logs, and events data.
All data in GreptimeDB is organized into tables with names. Each data item in a table consists of three semantic types of columns: Tag
, Timestamp
, and Field
.
- Table names are often the same as the indicator names, log source names, or metric names.
Tag
columns uniquely identify the time-series. Rows with the sameTag
values belong to the same time-series. Some TSDBs may also call them labels.Timestamp
is the root of a metrics, logs, and events database. It represents the date and time when the data was generated. A table can only have one column with theTimestamp
semantic type, which is also called thetime index
.- The other columns are
Field
columns. Fields contain the data indicators or log contents that are collected. These fields are generally numerical values or string values, but may also be other types of data, such as geographic locations or timestamps.
A table clusters rows of the same time-series and sorts rows of the same time-series by Timestamp
.
The table can also deduplicate rows with the same Tag
and Timestamp
values, depending on the requirements of the application.
GreptimeDB stores and processes data by time-series.
Choosing the right schema is crucial for efficient data storage and retrieval; please refer to the schema design guide for more details.
Metrics
Suppose we have a table called system_metrics
that monitors the resource usage of machines in data centers:
CREATE TABLE IF NOT EXISTS system_metrics (
host STRING,
idc STRING,
cpu_util DOUBLE,
memory_util DOUBLE,
disk_util DOUBLE,
ts TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
PRIMARY KEY(host, idc),
TIME INDEX(ts)
);
The data model for this table is as follows:
This is very similar to the table model everyone is familiar with. The difference lies in the TIME INDEX
constraint, which is used to specify the ts
column as the time index column of this table.
- The table name here is
system_metrics
. - The
PRIMARY KEY
constraint specifies theTag
columns of the table. Thehost
column represents the hostname of the collected standalone machine. Theidc
column shows the data center where the machine is located. - The
Timestamp
columnts
represents the time when the data is collected. - The
cpu_util
,memory_util
,disk_util
, andload
columns in theField
columns represent the CPU utilization, memory utilization, disk utilization, and load of the machine, respectively. These columns contain the actual data and are not indexed, but they can be efficiently computed and evaluated, such as for the latest value, maximum/minimum value, average, percentage, and so on. - The table sorts and deduplicates rows by
host
,idc
,ts
.
Events
Another example is creating a table for events like access logs:
CREATE TABLE access_logs (
access_time TIMESTAMP TIME INDEX,
remote_addr STRING,
http_status STRING,
http_method STRING,
http_refer STRING,
user_agent STRING,
request STRING,
) with ('append_mode'='true');
- The time index column is
access_time
. - There are no tags.
http_status
,http_method
,remote_addr
,http_refer
,user_agent
andrequest
are fields.- The table is an append-only table for storing logs that do not need deduplication.
- The table sorts rows by
access_time
.
To learn how to indicate Tag
, Timestamp
, and Field
columns, please refer to table management and CREATE statement.
Design Considerations
GreptimeDB is designed on top of the table model for the following reasons:
- The table model has a broad group of users and it's easy to learn; we have simply introduced the concept of time index to metrics, logs, and events.
- Schema is metadata that describes data characteristics, and it's more convenient for users to manage and maintain.
- Schema brings enormous benefits for optimizing storage and computing with its information like types, lengths, etc., on which we can conduct targeted optimizations.
- When we have the table model, it's natural for us to introduce SQL and use it to process association analysis and aggregation queries between various tables, reducing the learning and usage costs for users.
- We use a multi-value model where a row of data can have multiple field columns, instead of the single-value model adopted by OpenTSDB and Prometheus. The multi-value model is used to model data sources where a metric can have multiple values represented by fields. The advantage of the multi-value model is that it can write or read multiple values to the database at once, reducing transfer traffic and simplifying queries. In contrast, the single-value model requires splitting the data into multiple records. Read the blog for more detailed benefits of the multi-value mode.
GreptimeDB uses SQL to manage table schema. Please refer to table management for more information. However, our definition of schema is not mandatory and leans towards a schemaless approach, similar to MongoDB. For more details, see Automatic Schema Generation.