Normalizing User Data

Feb 26, 2024

Heya,

When you are data modeling for any application, you face a choice between normalizing data and using denormalized data. A canonical example is a mailing address in an ecommerce site. You can either have orders and the mailing address be separate and linked by a key:

CREATE TABLE Orders (
    ID INT PRIMARY KEY,
    OrderAddressID INT,
    CONSTRAINT FK_Address FOREIGN KEY (OrderAddressID) REFERENCES OrderAddress(ID)
    -- other order data
)

CREATE TABLE OrderAddress (
    ID INT PRIMARY KEY,
    OrderAddress String,
    OrderCity String,
    OrderCountry String
)

This is the normalized option. Or, you can have each order include a mailing address, which is denormalized:

CREATE TABLE Orders (
    ID INT PRIMARY KEY,
    OrderAddress String,
    OrderCity String,
    OrderCountry String
    -- other order data
)

Each option has different strengths and tradeoffs.

Normalized Data

With a normalized model, you have less data. If a customer has 100 orders, you store the mailing address one time and reference it in all the orders.

You also have less chance of incorrect data, because you store the address data once. If a user needs to change because it was incorrect or they move, there’s only one location to update.

However, you can run into an issue with address history. If someone orders five times from your ecommerce site, then moves and updates their address, you lose the previous address information. This can create issues if you want to run queries on which cities have seen the most year on year increase in orders.

There also can be performance issues (though you should always benchmark with your data set before believing anything anyone ever says about performance). For instance, if you want to know how many orders are going to a given city to optimize shipping, you must join across both the orders and address tables. If each of these tables has many rows, this could be problematic.

There are mitigations, such as caches, to help with these weaknesses. There are different types of normalization too.

Any time you are using a relational database such as PostgreSQL, SQL Server or MySQL, you need to consider what kind of normal form you want to apply to your data.

Denormalized Data

This option has strengths and weaknesses that are the oppositive of normalized data.

When you include a full mailing address in every order, you have a running history of all addresses, so there’s no chance of losing addresses when someone moves. When running queries against orders data, there’s no joins needed to retrieve city data, because it is present in every order.

On the flip side, you’ll see more data to store. The address is repeated across every order. Sometimes this is a lot more data.

Depending on how you obtain the address, it may be incorrect or slightly different for each order. You can standardize formatting, but if the formatting logic changes, you still have data drift. If you need to update an incorrect address for five orders, you have to do it five times. The same is true if you have to update an incorrect address for one hundred or one thousand orders.

Denormalized data was common before the invention of the relational data model as well as during the NoSQL renaissance of the 2010s, and is still used today in large scale systems. It is still used in primarily normalized systems when there are data hotspots.

What Does This Mean For CIAM Systems, Dan?

CIAM systems normalize user data. They extract profile, credentials and (sometimes) authorization data about users from myriad applications into a central system. CIAM systems aren’t alone in this; directories and IAM systems do this as well, though CIAM systems focus on customers and users, as opposed to employees.

This normalization is the source of many benefits and challenges of CIAM.

When you have centralized customer data, you can:

easily query user data across your userbase, letting you see patterns
control access to systems in a holistic way
allow users to update their profile data in one place and have it spread throughout other systems
offer one place for signing in using one set of credentials
build more advanced authentication functionality in one place and have all dependent applications benefit from it

These are some of the benefits of normalized user data. Of course, everything has tradeoffs.

Since applications depending on a CIAM system still need to have some level of user data, there’s a level of duplicate data. There may be application specific user data in the application datastore, even if it is keyed off an id “owned” by the CIAM system. Sessions will typically be handled on an application by application basis as well. And an application may take profile or authorization information provided by the CIAM system and store it locally.

There’s also an issue with updates across systems. If an application depends on an OIDC token for user data, for example, a user may update their profile information at the CIAM system, but an application may have an outdated token. This profile data has been cached in the id token, and you need to bust that cache.

You can work around this in a few ways:

by querying the CIAM system every time you need user data, which is similar to a join
setting the TTL of the cache (the lifetime of the id token) to a small enough level that the risk of old data being displayed is minimized
design your system so when a change occurs in the CIAM system, applications are notified

These are a few of the specific attributes of CIAM systems that are influenced by the way normalize user data.

So, next time you’re musing about centralizing user data in a CIAM system, remember your data modeling and you’ll be a step ahead when thinking of the tradeoffs required.

Dan

CIAM Weekly

Discussion about this post