There are many decisions to make when creating new tables and data warehouses. Some things that seemed insignificant at the time, made you and the user feel painful during the lifetime of the database.
We have worked with thousands of people and their databases and have experienced long-term read and write queries, and we have seen almost every situation. Here are 10 rules for creating a pain-free model.
1.USE ONLY LOWERCASE LETTERS, NUMBERS AND UNDERSCORES
Do not use dots, spaces, or hyphens in database, schema, table, or column names [Note 1]. Points are used to mark objects, usually in the form of database.schema.table.column.
Containing dots in object names will cause confusion. Similarly, using spaces in object names will force you to add unnecessary quotation marks in the query:
select "user name" from events
select user_name from events
If there are capital letters in the table or column names, the query statement will be difficult to write. If all letters are lowercase, people won't have to remember whether the users table is Users or users.
When you finally modify the database or copy your tables to the warehouse, you don't need to remember which table is case sensitive.
2.USE SIMPLE, SELF-EXPLANATORY COLUMN NAMES
If the users table needs a foreign key from the packages table, name the key package_id. Avoid short, obscure names, such as pkg_fk; others don’t know what it stands for. The self-explanatory name makes it easier for others to understand the pattern, which is essential for maintenance efficiency as the team size increases.
Don't use ambiguous names for polymorphic data . If you find yourself creating columns like item_type or item_value, then you'd better use more columns with specific names, such as photo_count, view_cout, transaction_price.
In this way, the content of the column can often be learned from the pattern, without relying on other values in the current row.
select sum(item_value) as photo_countfrom itemswhere item_type ='Photo Count'
select sum(photo_count) from items
Do not use the name of the containing table as the prefix of the column name. In general, letting the user table contain columns of the form user_birthday, user_created_at, user_name does not help much.
Avoid using reserved words such as column, tag, and user as column names. You will have to use extra quotation marks in the query.Failure to do so will make you confused about the error message. If the reserved words appear where the column names should appear, the database will greatly misunderstand the query.
3.USE SIMPLE, SELF-EXPLANATORY TABLE NAMES
If the table name consists of multiple words, use underscores to separate these words. package_deliveries is easier to read than packagedeliveries.
If possible, use one word instead of two: deliveries are easier to read.
select * from packagedeliveries
select * from deliveries
Don't prefix the table to imply the mode. If you need to group tables into ranges, put these tables into a schema. Table names such as store_items, store_transactions, and store_coupons are the same as prefixed column names and are usually not worth the extra typing.
We recommend using plurals for the table name (for example, packages), and plurals for both words in the name of the join table (for example, packages_users). Singular table names are more likely to occasionally collide with reserved words, and usually have lower readability in query statements.
4.THE PRIMARY KEY IS AN INTEGER
Even if you are using UUID, it does not make sense (for example, for joining tables), add a standard id column and a sequence of self-incrementing integers. This key makes certain queries easier, such as selecting only the first row of a set of data .
If the imported task needs to copy data, this key will be a lifesaver, because you can delete specific rows:
delete from my_tablewhere id in (select...) as duplicated_ids
Avoid multiple-column primary keys. They are difficult to infer when writing effective queries as much as possible, and difficult to modify. Use integer primary keys, multi-column unique constraints, and some single-column indexes instead.
5.BE CONSISTENT WITH FOREIGN KEYS
There are a lot of naming styles about primary keys and foreign keys. We recommend and the most popular is that any table foo must have a primary key named id, and all primary keys are named foo_id.
Another popular style uses globally unique key names.The table foo has a primary key named foo_id, and all foreign keys are also named foo_id. If you use abbreviations (the primary key of the users table is uid), it will cause confusion or naming conflicts, so do not abbreviate.
No matter what style you choose, keep it. Do not use uid in some places and user_id or users_fk in other places.
select *from packages
join users on users.user_id = packages.uid
select *from packages
join users on users.id = packages.user_id
select *from packages
join users using (user_id)
Also pay attention to foreign keys that don't obviously match the table. The name of the column named owner_id may be a foreign key to the users table, or it may not. Take the column name as user_id and, if necessary, as owner_user_id.
6.STORE THE TIME AS DATETIME
Don't keep dates as Unix timestamps or strings: instead convert them to datetime. Although SQL's date math function is not the best, it is even harder to handle the timestamp yourself. Using the SQL date function requires that the timestamp be converted to datetime for each query:
select date(from_unixtime(created_at))from packages
select date(created_at)from packages
Do not store the year, month, and day in separate columns. This makes each time series [Note 2] query very difficult to write, and will prevent most new SQL users from using the date information in the table.
select date(created_year ||'-'
|| created_month ||'-'
7.UTC, ALWAYS UTC
Using a certain time zone instead of UTC will cause endless problems. Excellent tools (including Periscope ) have all the functions you need to convert data from UTC to the current time zone. In Periscope, add :pst to easily convert UTC to Pacific Time:
select [created_at:pst], email_addressfrom users
The time zone of the database should be UTC, and all datetime columns should be of the type without the time zone (time stamp without time zone).
If the time zone of your database is not UTC, or your database has both UTC and non-UTC datetime, then the difficulty of analyzing time series will be greatly increased.
8.A SINGLE SOURCE OF TRUTHFUL DATA
For a piece of data, there should be one and only one true source [Note 3]. Views and summaries should be labeled. In this way, data users will understand that there is a difference between the data they use and the real data.
select *from daily_usage_rollup
Leaving columns such as user_id, user_id_old, user_id_v2, etc.discarded will become a never-ending source of confusion. In daily maintenance, be sure to drop discarded tables and discarded columns.
9.PREFER TABLES WITHOUT JSON COLUMNS
You certainly don't want a very wide table. If there are many columns, and some of them are named sequentially (such as answer1, answer2, answer3), you will suffer in the future.
Split this table into a pattern with no duplicate columns, the form of this pattern will be particularly easy to query. For example, to get the number of completed answers in the survey table:
(case when answer1 is not null
then 1 else 0 end) +
(case when answer2 is not null
then 1 else 0 end) +
(case when answer3 is not null
then 1 else 0 end)
) as num_answersfrom surveyswhere id = 123
select count(response)from answerswhere survey_id = 123
For analytical queries, extracting data from JSON columns can greatly reduce query efficiency. Although there are many reasons to use JSON columns in a production environment, they are not for analysis. Strongly convert JSON columns into simpler data types, making analysis easier and faster.
Date, zip code, and country do not need to have their own tables with primary key queries. If you bring it, each query will contain a small number of identical connections. This creates a lot of repetitive SQL for the database, and a lot of extra work.
join dates on users.created_date_id = dates.idgroup by 1
count(1)from usersgroup by 1
Tables are the first type of objects with large amounts of their own data. All other data should be additional columns on more important objects.
LOOKING FORWARD TO A BETTER MODEL!
Armed with these rules, your next table or warehouse will be easier to query for you and new team members as the team grows. If you disagree or have more suggestions on rules, please email us at [email protected] . We would love to hear your voice!