Wednesday, March 27, 2013

UUIDs in MySQL

Auto incremented numbers can be so nice for unique IDs in a database; they're guaranteed to be unique, can range to fit your needs (tinyint to unsigned bigint), and easy to use. When working in a sharded environment, they may not fit the bill anymore; you either lose global uniqueness or you lose ease of use (depending on your sharding mechanism). One very good alternative is to use UUIDs.

UUIDs solve the global uniqueness and ease of use problem at the same time, but introduces another problem in the database layer - memory footprint. A UUID is a 36 byte ASCII string of characters representing a 16 byte value. I've seen them stored as CHAR(36), VARCHAR(36), etc. The problem with storing them in their ASCII form is that the index will grow beyond the capacity of the database server faster than when they're stored in a binary form. Memory will fill faster, queries will hit the disk more frequently, and it will ultimately cost more money to save the exact same information. It's not worth it.

Data-Driven Analysis

What follows is the result of a test I performed on my laptop to see just how much worse storing a UUID as VARCHAR(36) is, as compared with the less bulky BINARY(16). Just guessing, it will be 36/16 worse, right? I wanted to be sure, since MySQL does many things under the hood.

Database Tables

We'll create 2 separate tables, one for binary representation and the other for ascii. These two tables will have a key on the UUID field, but that key will not be primary. Typically, you'll use the UUID as the primary key, but I found key sizes easier to report if they were not primary.

CREATE TABLE `uuidbinary` (`id` BIGINT NOT NULL AUTO_INCREMENT PRIMARY KEY, `uuid` BINARY(16) NOT NULL, `value` BIGINT NOT NULL, KEY (`uuid`) USING HASH) ENGINE=INNODB;

CREATE TABLE `uuidvarchar` (`id` BIGINT NOT NULL AUTO_INCREMENT PRIMARY KEY, `uuid` VARCHAR(36) NOT NULL, `value` BIGINT NOT NULL, KEY (`uuid`) USING HASH) ENGINE=INNODB;

Script

Rather than bore you with a stupid-simple PHP script, let me give you pseudo code.

connect to db
for 0 to 1000
  query "insert into uuidbinary values (null, unhex(<uuid without dashes>), <random int>)"
for 0 to 1000
  query "insert into uuidvarchar values (null, <uuid with dashes>, <random int>"


I split the inserts so that I could monitor their execution time independently. The only queries that were run were INSERT INTO queries.

Results

As expected, the binary table was much more efficient at inserting and storing rows. I will assume that you are smart enough to read 6 graphs. I will summarize at the end.

Total Index Size - Linear Axis

Total Index Size - Logarithmic Axis
Storage Size - Linear Axis

Storage Size - Logarithmic Axis

Index Size / Total Size

Insertion Time / row
We can summarize these graphs by saying "yes, it's worth it to store your UUIDs as BINARY(16)". It will cost less money in disk space, memory, CPU, bandwidth, and extra database servers.

Extrapolation

Using either the data size, the index size, or the sum of both, the varchar format ends up using about 1.5x as much space as the binary format. You're probably doing something wrong if you're using a relational database like MySQL, and only storing IDs once, though. For something like a user ID, you are likely storing the ID at least 5 times. The 1.5x becomes 7.5x. If your database server is running with 7.5 GB of RAM dedicated to user IDs, you can reduce it to 1.5 GB by storing those UUIDs in the BINARY(16) format.

Storing the user ID 5 times seems excessive, but at Lucidchart, we store it in the following ways:

  • Once per user
    • User Record
    • User Attributes
  • Zero to many per user
    • Payments
    • Documents
    • Team
    • Folders
    • Invitations
    • Valid Sessions
    • Newsletters
    • Images
    • Image Tags
    • Spellchecker Words
    • Statistics (millions per user)
    • Aggregated statistics (thousands per user)
As you can imagine, the more times you use the same ID, the more you can save by using my recommended method.

Update: I've made a new post about how to use UUIDs with Play and Anorm.

1 comment:

Note: Only a member of this blog may post a comment.