6 min read · Apr 7, 2023
--
Time-series data consists of sequence of measurements over a period of time with common metadata. Time-series data is everywhere, and companies need to collect and analyse this data to understand what is happening right now in their businesses, and to promptly assess future needs. Time-series data is commonly used in various fields such as finance, economics, meteorology, energy, and IoT.
Before version 5.0 of MongoDB, there were two main options for storing time-series data in MongoDB.
- Store each individual data point as a separate message in a regular MongoDB collection.
- Using the bucket pattern to bucket data based on time or a fixed size.
These approaches however had a number of limitations, including:
- Much higher data and index storage when compared to time-series data stores.
- Slower data access.
- Overhead of managing buckets.
Native time-series support has been available in MongoDB since version 5.0 and this post aims to explore the benefits, limitations and best practices.
For many years people have used MongoDB to store their time-series data. Many have come to understand that you don’t simply store time-series data “as-is” directly into regular MongoDB collections. This method often results in an excess of storage and processing overhead, unnecessarily large index structures, and typically suboptimal performance.
In the past, the best approach to efficiently store time-series data in MongoDB was to implement the Bucket pattern. The main objective of this pattern was to store several measurements which logically belong together into a single document. Since indefinitely growing a single document and its associated bucket is not feasible, the application logic had to ensure the creation of a new bucket once specific thresholds were reached.
Although this method was effective, it required significant investment in upfront schema design and imposed a greater burden on developers who were responsible for implementing and fine-tuning the bucketing logic. In addition, some types of queries against the bucketed data required extra effort due to the way the data was being stored.
MongoDB since version 5.0 has provided native time-series support so developers now get the benefits of the Bucket pattern without the development overhead. Time-series data can be inserted and queried in the same way as you would with regular collections which makes it faster, easier, and less expensive to work with time-series workloads.
The following characteristics are typical of time-series workloads:
- Inserts arrive in batches that are sequentially ordered by time.
- Data sets are typically append-only.
- Queries and aggregations are typically applied within a specified time range.
MongoDB’s time-series collections take advantage of these characteristics by organizing writes in a way that data from the same source is stored together in the same bucket with other data points from a similar time, which optimizes index and memory usage.
MongoDB time-series collections are composed of three components:
Time
- when the data point was recorded.Metadata
- the unique identifier for a series that rarely changes.Measurements
- data points tracked at increments in time.
Internally the Metadata
is recorded once per timeseries and the Measurement
fields are stored and accessed in a columnar fashion. Multiple buckets are automatically created for storing data based on the granularity.
Creating a Time-Series Collection
When a time-series collection is created some time series related options are specified. Some of the options available are:
timeField
(Required)
The name of the field which contains the date in each time-series document (must have a valid BSON date as the value).metaField
(Optional)
The name of the field which contains metadata in each time series document. The metadata in the specified field should be data that is used to label a unique series of documents. The metadata should rarely, if ever, change.granularity
(Optional)
Possible values are:"seconds"
(the default),"minutes"
or"hours"
.expireAfterSeconds
(Optional)
The number of seconds after which documents expire and are automatically deleted by MongoDB.
For the full set of options please refer to the official MongoDB documentation here: Creating a Time-Series Collection
Buckets Size
Time-series collections in MongoDB store data in buckets. Each bucket represents a group of documents stored together as one document with the same metaField
for a timespan. The time-range of measurements in a bucket is influenced by the granularity
that is set when the collection is created.
seconds
(covered timespan: 1 minute to 1 hour)minutes
(covered timespan: 1 hour to 1 day)hours
(covered timespan: 1 day to 30 days)
In addition to granularity
bucket size is controlled by two settings:
timeseriesBucketMaxCount
(default: 1000)
Maximum number of measurements in a bucket.timeseriesBucketMaxSize
(default: 125kB)
Maximum size of the measurements in a bucket.
Storage Reduction
Massive storage reduction is achieved using a combination of column compression techniques and the use of “zstd” compression instead of the default compression algorithm “snappy”.
Storage is also improved because each bucket exposes a min/max timeField
and indexes use the bucket level fields (instead of the timeField
of the individual documents) which results in significantly smaller indexes.
Compression of almost 30 times can be achieved with time-series collections compared to the regular compression which is around 2 or 3 times.
Adding Secondary Indexes
To improve query performance further, secondary indexes can be added. The MongoDB recommendation is to create secondary compound indexes on the fields specified as the timeField
and the metaField
. If the field value for the metaField
field is a document, you can create secondary indexes on fields inside that document.
Reduced I/O for Read Operations
Internally, each bucket exposes the min/max timeField
and time-series collections have a system generated clustered index on the bucket time which orders the data on disk so that adjacent buckets can be stored in the same page which reduces the cost of scan times. Furthermore, indexes utilize the min/max timeField
of the bucket (instead of the timeField of the individual document) so indexes are much smaller and scans only look at buckets that are within those time ranges.
Preventing Duplicates
If there is a need to prevent duplicates, then because time-series collections do not support “unique indexes” or “upserts” a find operation needs to be performed with every insert. This would increase the servers CPU requirements therefore this is not a recommended approach.
Instead, the best practice is to batch write data with insertMany()
. Although this does not allow for the prevention of duplicates, an aggregation pipeline can be used to filter them out and can be presented as a view.
It is important to note that generating duplicate data in data ingestion pipelines should be avoided. Conducting duplication checks can be helpful in detecting issues with data ingestion processes.
When compared to regular collection storage, time-series collections offer enhanced query efficiency and reduce disk usage for both time-series data and secondary indexes. This is because time series collections employ a columnar storage format and organize data in chronological order using an automatically created clustered index.
The benefits on MongoDB include:
- Increased developer productivity.
- Reduced complexity for working with time-series data.
- Reduced I/O for read operations.
- Massive reduction in storage and index size (almost 30 times compression).
Time-series collections have a number of limitations that include:
- No support for Change Streams.
- No support for Schema Validation Rules.
- No support for Transactions.
- No support for Unique Indexes (no way to prevent duplicates).
- Limited support for Updates and Deletes.
The limitations are largely a result of the time-series collections’ emphasis on the common characteristics of time-series workloads, and introducing these features for time-series would be complicated and impede the data ingestion.
For the full set of limitations please refer to the official MongoDB documentation here: Time-Series Collection Limitations.
- Disable retryable writes to maximize write throughput.
- Batch writes with
insertMany()
. Batches should contain multiple documents per series (as defined by the metaField). - Choose a
metaField
that never of rarely changes and which uniquely identifies a series. - Allocate more physical RAM to scale for high cardinality use cases (e.g., more than a few thousand unique metaField combinations).
- Shard your collection to increase cardinality handling and throughput.
- Create secondary indexes (e.g., { metaField: 1, timestamp: 1}).