postgresqlolap

Dealing with dynamic dimensions for metrics data / analytics


We are implementing a feature which allows users to store metrics, with arbitrary dimensions on it. The dimensions, although arbitrary, are used across the whole system.

So we made two tables to store dimensions:

class Dimension():
        """ Stores the `type` of the dimension """
        __tablename__ = 'dimensions'
        id: str
        title: str
        description: str

class DimensionValue():
        """ Stores the dimension itself, each dimension has a type """
        __tablename__ = 'dimension_values'
        id: str
        title: str
        description: str
        type_id: str = ForeignKey('dimension.id')

And then we have something like:

regions = Dimension(id=1, title="Region", description="Regional Activities")
department = Dimension(id=2, title="Departments", description="Activities by department")

us = DimensionValue(title="US", description="US Activities", type_id=1)
europe = DimensionValue(title="Europe", description="Europe Activities", type_id=1)

marketing = DimensionValue(title="Marketing", description="Activities related to the marketing department", type_id=2)
technology = DimensionValue(title="Technology", description="Activities related to the tech department", type_id=2)

We haven’t seen this that often, looking at benchmarks we usually see pre defined dimensions, but the extra flexibility is important to us. Now, we have metrics that goes something like this:

class Metric():
        """ Stores the metric metadata """
        __tablename__ = 'metrics'
        id: str
        title: str
        description: str
        unit: str
        periodicity: str

class MetricValues():
        """ Stores the dimension itself, each dimension has a type """
        __tablename__ = 'metrics_values'
        id: str
        timestamp: str
        value: float

class MetricValuesDimensions():
        """ Stores the dimension itself, each dimension has a type """
        __tablename__ = 'metrics_values_dimensions'
        metric_id: str = ForeignKey('metrics.id')
        dimension_id: str = ForeignKey('dimension_values.id')

We are unsure about this approach though. One metric value can have N dimensions, and the uniqueness is only assured by a complex key (timestamp, metric_id and the dimensions) based on a join.

Apart from that, we have the relationship happening on the “value” level, which seems off, even though we don’t expect a huge volume (we would normally deal with daily/weekly/monthly data), there seems to be a lot of duplicated information - imagine 150 data points (~3 years of weekly data), the dimension/value relationship would be replicated 150x.

So we though of adding another layer to metrics, called MetricsSeries . One metric would have N Series and one Series N Values, now we could move the Dimension relationship to Series and would have a lot less duplication (we don’t have to repeat the relationship every time for each value).

Now this would make the “ETL” kind of a pain I guess, since I would have to flatten the metrics table with an unknown number of dimensions, so it would turn into some complex recursive style workflow. It starts to look too complex.

The other approach we can think of (one that we don't really like) is to have a predefined set of dimension “slots”, say 5 dimensions, and then have user defined Dimension Titles. So following the example, something like:

class Dimension():
        """ Stores the mapping between the generic # dimension and the user defined title """
        __tablename__ = 'dimension'
        id: str
        title: str
        description: str

class DimensionValue():
        """ Stores the values of the dimensios """
        __tablename__ = 'dimension_values'
        id: str
        values: str

class Metric():
        """ Stores the metric metadata """
        __tablename__ = 'metrics'
        id: str
        title: str
        description: str
        unit: str
        periodicity: str

class MetricSeries():
        """ Stores the metric metadata """
        __tablename__ = 'metrics_series'
        id: str
        dimension_1: str = ForeignKey('dimension_values')
        dimension_2: str = ForeignKey('dimension_values')
        dimension_3: str = ForeignKey('dimension_values')
        dimension_4: str = ForeignKey('dimension_values')
        dimension_5: str = ForeignKey('dimension_values')

class MetricSeriesValues():
        """ Stores the dimension itself, each dimension has a type """
        __tablename__ = 'metrics_series_values'
        id: str
        timestamp: str
        value: str = ForeignKey('metrics_series')


This would make our life easier, since the cube is pretty much ready, and we could query and filter straight into the series table. The problem we see in this case, is that we would have plenty of NULLs in the MetricSeries table. It would ensure uniqueness on the series level (which is flat regarding dimensions), and make OLAP pretty straight forward.

The downside obviously is how rigid the system becomes in terms of # of dimensions. One is not expected to have hundreds of dimensions, so some restriction is ok, and we could easily do the same for 20 dimensions. Lots of empty rows/columns I guess, but again, it would make things easier for the analytical queries.

At last, we wonder if we should leverage the JSONB/ARRAY column types. I imagine people are not very fond if this approach, but we already ensure the schema and data type at the application level, so at least a couple of the downsides of this approach would be covered. Don't know about performance though.

Some other relevant information: we imagine the number of dimensions (and dimension values) to be fairly low, each tenant would have 10, maybe 15 dimensions. Each dimension would also have around 10-15 values. The amount of data should also be low as we are dealing with weekly/monthly/yearly data. We value flexibility to query (think of a pivot table for the end user) and performance. Insert performance is less important, as they can be done asynchronously without much issue. We wouldn't mind some restriction to begin with, but wanted to avoid significant costly design mistakes. This is a multi tenant application, so it have to take into account that this will always be filtered by tenant no matter what. Any other detail I might be missing I would be happy to edit this further.

Would love to have some feedback, we have next to zero experience with analytics platforms and such.

Thanks a lot


Solution

  • Now this would make the “ETL” kind of a pain I guess

    It will that, I promise, and more. You've already identified why, in terms of "recursive style workflow". I would put it in database terms: it defeats SQL's capability to apply relational algebra by hiding the relationships of the data from it.

    The dimensions, although arbitrary, are used across the whole system.

    I'm sure you mean that, but I doubt it. Some code -- a query, and application, a report -- must refer to those dimensions. Each one has some meaning and some domain. No dimension is automatically used unless some aspects of it are somehow implied. Proof: if I add dimension "arbitration_time", how will the data be used anywhere before the term is defined?

    The fact that you anticipate queries using multiple dimensions says that they have meaning within the relational model. When you say, "the extra flexibility is important", what you're really saying is that you don't have your arms around the problem domain yet. You don't know what the dimensions are (or which ones you care about) so you want to add them willy nilly as data, and deal with the fallout later.

    The right way to design a database is through iterations of data analysis. Each attribute is scrutinized for its proper location and attributes. It's a high-skill endeavor, and depending on the complexity can take months. Usually these days neither management nor the technical team has the stomach for that, and the DBA is relegated to catching whatever is tossed over the wall.

    The solution to not understanding the problem is often some kind of entity-attribute-value (EAV) "design". It's flexible, but you might as well not use a DBMS, because queries become a nightmare and the DBMS cannot enforce constraints.

    The au courant EAV grease in DBMS's today is JSON. JSON is just another way of tucking attributes into a database in a "flexible" way that defeats SQL's ability to enforce data consistency. Does your DBMS can define constraints on a JSON attribute with precision equivalent to a column? If it does, why not just use a column? If it doesn't, why not just use a column?

    What to do? My advice: be strictly flexible. Iterate toward correctness.

    If you can't design the database ahead of time using prior knowledge, then design the database "in real time" using accruing knowledge. Name things what you think they should be named, and put them where you think they should go, using 3NF as a baseline. If something turns out to be in a key that shouldn't be, redefine the key. If something seems to be unique, make it so; if if turns out it's not, relax the constraint.

    Be generous with table and column creation. It's easier to consolidate duplicated tables into one than to disentangle commingled data that are sharing a column (overloading its meaning) because someone wasn't allowed to create an appropriate column. Some DBMSs let users create database objects in their own segregated namespace. That can be used as a sandbox for experimentation and later incorporation into the main, shared tables.

    Let someone be in charge of names, and agree on some conventions. No one is served if one table has acct and another Account_Number. Eschew abbreviations in database names, because they lack context. The whole English name is less ambiguous and easier to remember than which abbreviation the database happened to choose. (From your examples, that looks under control.)

    There aren't endless dimensions. It only seems that way now. Maybe there are 100, but 1000? Put things where they seem belong, and adjust as knowledge improves. In that way, iterate toward correct model everyone can agree on.

    The process I'm describing involves some churn. A changing schema inevitably breaks queries and load processes. That's the price of not having an up-front design process.

    But the alternative is worse. I've seen the results of a full-fledge EAV model, implemented by a group that understood neither the relational model nor the problem domain they were working in. The result was an incomprehensible mountain of complex, unmaintainable code (database and application both) and a performance nightmare. On the upside, it was a full-employment contract for anyone who joined the team.