We discuss a DB design inefficiency today. Picked up from a real-world use case, I hope this helps clarify both – how to and how not to design enumeration tables.
Intro
Enums are objects of a class. Statically defined, they help reduce the time to create and use objects ( among others ).
Enums from the programming world are sometimes mapped into the DB also. As an example … we might have tables called Cities, Countries, or Types. These tables will have specific values … e.g.
Cities enum:
Name |
Code |
---|---|
New Delhi |
NDLS |
New York |
NY |
Similarly, other enumerations have their specific names and codes.
Lights, Camera …
Enums can be kept in the DB in multiple ways.
One can be – keep all city names in a Table. Other tables or programs can read from this table and keep the name.
But this method has a drawback. The name of the city needs to be duplicated in every single row. Meaning, the string “New York” is stored across many tables/ rows. Storage across multiple rows consumes huge redundant space on the DB server.
From Stackoverflow :
The storage requirement for a short string (up to 126 bytes) is 1 byte plus the actual string, which includes the space padding in the case of character. Longer strings have 4 bytes overhead instead of 1.
This means, for an average of 10,000 rows we end up giving away 1 MB of space. (~100 bytes * 10000 ). And we haven’t considered …
PS : Memory is the only consideration till now. But it will not remain so. Let’s see.
Action
One way of mitigating this is using a representation of the City. This can be done in various ways.
GUID/ UUIDs
Name |
Code |
---|---|
New Delhi |
4eecc9f5-7535-4ff2-ae93-ee6d0a9a281c |
New York |
e728695b-b9d7-4724-8942-6dc472fcecb6 |
The string “New York” takes ~100 bytes, and a UUID takes 16 bytes. Therefore, GUID, UUID identifiers representing Strings reduce memory usage. Great !
But wait a sec … did we introduce new problems ? 😕 😕
Yves Trudeau explains : UUID comparisons are about 28 times slower. Even if the difference appears rapidly in the char values, it is still about 2.5 times slower.
We did save space but introduced a performance issue. 😹 😹
For developers, ‘NY’ is easier to understand compared to understanding ‘b9def35e-2d79-4c2f-b0a8-aae6e8b000ef’
GUIDs or UUIDs are random numbers and therefore cannot be interpreted easily. Even long working developers need an additional query to understand their mapping.
But, this is an invisible impact. How do we measure it ?
Assuming this additional translation takes 1 minute each time. ( I am being conservative )
An approximation can be: 15 mins of additional effort understanding data, per developer. For a 100 member team. This becomes – 100 * 15 = 25 hours !
Meaning, 1 day of productivity is lost among a 100 member team, each day !
Insignificant ? I think not !
PS : Obscurity by UUIDs might be a good thing, in case the use case needs it.
Next, we move on to codes.
Codes
What happens when we refer to our enumerations using codes ?
Name |
Code |
---|---|
New Delhi |
ND |
New York |
NY |
While the string “New York” took ~100 bytes, “NY” takes lesser ( lets say ~20 bytes). So, a smaller memory footprint is the good news.
The second benefit is processing time related. Comparing 2 alphabet codes is faster than comparing 36 alphabet ones (UUID). ( see Trudeau’s article for details )
Yet another benefit is that the code is human-understandable. We discussed the cost of human readability at 25 hours for a 100 member team. In this case, we might have reduced the cost by 80%. Saving 22 hours, each day for an organization is a significant achievement.
Satisfied ? or do we still have Need For Speed ? ( yes the ref. Is to the popular video game ! )
Let’s check out the final option …
Integers
What if we represent each City/ Country with an Integer ?
Name |
Code |
---|---|
New Delhi |
1 |
New York |
2 |
Since an Integer takes up 4 bytes of space, we have further improved memory usage. At the same time, we know that integers process faster. Cherry on the top !
But, we’ve done so at a cost … can you guess what ?
We gave way the ‘human readability’ factor. Adding ~25 hours for a 100 member team. Not an insignificant number.
Moral of the story
My personal favourite among the choices is the use of codes. While it does give away memory and CPU clocks. It provides human readability.
If not codes, I would go with Integers. Using idenfifiers like GUID/ UUID would be my last preference and would be based on the use case.
But, do NOT take my word. Please think for yourself.
Each organization needs to weigh between memory and human legiblity costs. As examples :
Modern day databases are designed to handle much bigger loads. In this case, the redundancy is less. Therefore, human readability can be given greater weightage. Meaning, we could go with NY type codes.
~20 bytes taken up by the code NY copied millions of times is a bigger inefficiency. Using Integers will be a better approach in this case.
So – To each thy own
With this we wrap up a DB design inefficiency ( for enumerations ). More coming up !