june 16, 2023
Best practices for schema design, query debugging and database support in Greenplum
Best practices are a collection of methods and approaches that have been shown to be superior to alternatives as a result of their use. It is an amalgamation of knowledge and experience that allows the product to be used correctly and optimally.
Creating data warehouses, like any software system, is not only about usability. Performance, extensibility and maintainability are also extremely important elements. All of these topics are complex and voluminous. And for each of these aspects, there are best practices to follow.
Of course, following only the best practices alone is unlikely to get the most out of the system, because every project is unique. However, using these rather simple practices will improve the performance and simplify the expansion and maintenance of your Greenplum cluster.
In this tutorial we will try not only to describe the rules to follow, but also explain why they should be followed. The specifics of schema design and ways to optimize queries in the realities of the Greenplum MPP architecture will be discussed. We will also pay attention to some aspects of base support.
Stick to the same data types in the columns used for merging. If the data types are different, Greenplum has to perform additional conversions.
Use the data types with the smallest required size. Using TEXT or VARCHAR instead of CHAR does not affect performance, but it saves disk space. The Greenplum documentation recommends not to fill the disk more than 70% because the base needs some free space to operate.
Greenplum provides several storage models.
Heap storage is the default storage model and supports only string orientation.
Append-optimized storage allows both string and column orientation; the default orientation is string.
It should be noted that Greenplum supports hybrid storage — when using sectioning, a different storage format can be specified for each of the sections.
Here are recommendations for selecting and using storage models:
Tables with append-optimized storage are designed to store large fact tables that are infrequently updated and used for analytical queries, also append-optimized storage saves about 20 bytes per row.
However, for this type of storage, delete and update operations, especially frequent ones with few rows, are not efficient. Append-optimized table data files are actually append-only — available for addition only. It is physically impossible to overwrite stored rows. However, implementation peculiarities, which are unnecessary to talk about within this article, make DELETE and UPDATE operations possible from the user’s point of view.
Heap and Append-Optimized Storage
- Use heap storage if iterative batch or single data modification operations will be performed on the table.
- Use heap storage if parallel data modification operations will be performed on the table.
- When creating heap tables, place the columns in the following order: allocation and sectioning keys, fixed size types from larger to smaller, variable size types. This will achieve data alignment and improve performance.
- Use append-optimized if the data in the table is updated infrequently and in large batches.
- Avoid single data modification operations on append-optimized tables.
- Do not perform concurrent update or delete operations on append-optimized tables. Parallel batch insert operations are allowed.
With columnar storage, each column is stored in a separate file, which effectively compresses uniform data and speeds up queries to a small number of columns.
Row-oriented and Column-oriented
- Use string storage when the query uses many columns of the same row.
- Use string storage for mixed loads.
- Use columnar storage for queries that use few columns.
- Use columnar storage if one of the columns in the table is frequently updated and other data is not affected.
Greenplum provides many compression options for append-optimized tables or sections. Compression improves I/O by increasing the amount of data received with each disk read operation.
There are the following nuances and good practices when choosing a compression method:
- The compression should be set at the section level. Note that added sections do not inherit the table compression settings — they must be specified separately.
- The best compression level is provided by the RLE algorithm. Increased compression level generally results in more compact disk space, but requires additional processor cycles for writing and reading.
- The RLE algorithm works best with files that contain repetitive data.
- By sorting the data and combining different compression options, the best level of performance can be achieved.
- Never use database compression if the data is stored in a compressed file system.
- In general, the optimal compression algorithm is determined by trial and error. Test different types of compression and data ordering to determine the best method for your data.
One of the key aspects with respect to performance is to choose the right allocation. Since a system with a parallel architecture will only complete its work when all its parts have dealt with the task, it runs at the speed of the slowest element. Therefore, it is necessary to distribute the load evenly over the segments.
If possible, you should distribute large mergeable tables on the same key, which will allow you to achieve local merging. In this case, the data are connected on Greenplum segments independently, which can significantly improve performance due to the lack of data movement. Here again it is worth bringing up the importance of the same data types for the same fields in different tables — the same values with different data types may be hashed differently, due to which distribution over the same segments will not happen.
In the case of uneven distribution, data skew occurs (Data Skew), which is often the cause of poor performance. It is extremely important to check the distribution after changing the data in the database. The following query will display the maximum and minimum number of rows per segment for the facts table and the percentage deviation between these values:
max(c) AS «Max Seg Rows», min(c) AS «Min Seg Rows»,
(max(c)-min(c))*100.0/max(c) AS «Percentage Difference Between Max & Min»
FROM (SELECT count(*) c, gp_segment_id FROM facts GROUP BY 2) AS a;
In addition to data skew, processing skew can occur. Processing skew occurs during query execution, which makes it harder to track. If some segments do not work during a query, the problem is probably a computational skew. First of all, you should pay attention to spill files — they are created when Greenplum does not have enough RAM to execute a query. Since the speed of the hard disk is much lower than the speed of RAM, this seriously affects performance.
Usually it helps to get rid of spill files by changing the request, e.g. by using temporary tables.
There are two views in the gp_toolkit scheme to help track spill:
- In the gp_toolkit.gp_skew_coefficients view, the skccoeff column shows the coefficient of variation for the table. The lower the value, the better.
- In the gp_toolkit.gp_skew_idle_fractions view, the siffraction column displays the percentage of the system that is idle during the table scan. For example, a value of 0.1 indicates a skew of 10%. A skew of more than 10% is a reason to change the table distribution.
There are the following recommendations regarding distribution:
- Explicitly set the allocation principle. The default allocation uses the first column in the table, which is rarely optimal.
- It is desirable to use a single column for the distribution.
- The distribution column should either have unique values or very high cardinality. If a column has many NULL values, you should not use it as a distribution key, because all records with such a value will be defined for one segment.
- If it is not possible to achieve a uniform distribution over one column, you should try to use a distribution over two fields. However, this rarely helps to achieve a more even distribution and requires additional hashing time.
- If the two-column distribution key did not achieve an even distribution of the data, random distribution should be used. Allocating to more than two columns is rarely more productive because it requires unnecessary forwarding of data between segments.
- Grouping by distribution key will be faster due to the lack of data forwarding.
- Do not use fields used in filtering as allocation keys.
- Nor should you select dates as the allocation key. Often, data is selected over a period of time, so dates should be left as the sectioning field.
Sectioning is an important aspect of Greenplum, because in addition to speeding up scanning it allows you to create tables with a hybrid storage format.
The storage format can be specified for each section separately, and sections can also store their data in remote sources.
The following should be taken into account when deciding on sectioning:
- Use sectioning only on large tables. For tables of several thousand rows, the difficulty of maintaining sections outweighs the performance gain.
- Use partitioning only if you can achieve partition elimination in your queries. Otherwise, partitioning will only do harm. You can verify that partition elimination is achieved by using the EXPLAIN construct.
- Never partition a table based on a distribution key.
- If possible, prefer range partitioning to list partitioning.
- Be careful when creating sections on columnar tables. The number of physical files for column tables is equal to the number of segments multiplied by the number of columns multiplied by the number of sections. In general, it is considered good practice to limit the number of files per segment to one hundred thousand.
- Try to make sure that the data in the sections is evenly distributed. If the data is evenly distributed over 10 sections, a scan that hits one of the sections will be 10 times faster than a table scan without sectioning.
- The scheduler can selectively scan sections only when the query contains a simple restriction: =, <, <=, >, >=, <>
- Selective scanning works with functions with variability categories STABLE or IMMUTABLE, but not VOLATILE. For example, if date > CURRENT_DATE the scheduler will selectively scan sections, but if time > TIMEOFDAY it will not.
- Try not to use multi-level sectioning. The increased complexity of administrative tasks will usually outweigh the performance gains. For performance, scalability, and maintainability, you need to find a balance between scan speed and number of sections.
Due to its distributed architecture, Greenplum performs sequential scans quickly, and its use of indexes is not as critical as in traditional databases.
Indexes can both positively affect performance and degrade it. Consider the following when deciding to create indexes:
- Indexes can improve the performance of highly selective queries. For example, you can create an index on one column in a column table if that column is used in drill through queries.
- Do not create indexes on columns that are frequently updated.
- Remove indexes before loading data.
- For columns with high selectivity choose B-tree index.
- Use bitmap indexes on columns with low cardinality — 100 to 100000 unique values.
- Do not create bitmap indexes on updatable columns.
- In general, you should not use indexes on sectioned tables. If there is such a need, the sectioning field should be distinguished from the indexing field.
Optimizer and query plan
Preamble: This and the following paragraphs will refer to the configuration parameters responsible for various aspects of the base. You can read the parameters in the documentation by following this link:
Greenplum uses a cost-based optimizer — when building a query plan, the optimizer calculates the cost of each action and selects the least expensive method. The cost in 1.0 is equal to one consecutive page read from disk.
Greenplum supports two optimizers: GPORCA and Postgres Planner. In general, GPORCA optimizes queries better for multicore architecture and is the standard optimizer. However, GPORCA is optimized for analytic queries and tends to take longer to build a plan than Postgres Planner, so the latter is better for light queries. In general, it’s worth checking queries using both optimizers.
When choosing GPORCA, there are some things to consider. You can read more about this at the link:
The enable_optimizer parameter is responsible for selecting the optimizer. If ON, GPORCA will be used.
The EXPLAIN and EXPLAIN ANALYZE constructs allow you to view the query plan, which allows you to evaluate how much each action affects performance, and to understand how to improve speed.
EXPLAIN displays an approximate plan, without performing the query. EXPLAIN ANALYZE runs the query and displays the actual plan with additional information, such as:
- Total elapsed time in milliseconds
- The number of segments involved in the execution
- The number of lines returned by the segment that produced the most lines during the operation together with the ID of that segment
- The memory used in the operation
- The time it took to get the first line and the total time it took to get all lines for the segment that produced the most lines
If you want to view the plan with EXPLAIN ANALYZE for queries that modify data (INSERT, UPDATE, DELETE), you should use transactions (BEGIN; EXPLAIN ANALYZE …; ROLLBACK;) to avoid affecting the data.
The plan is a tree-like structure with nodes specifying transactions. The plan is read from the bottom up — each node passes the result to the node directly above it. The cost and time in a node is calculated to include all child nodes. That is, the values of the topmost node, as a rule, and is the value of the entire query.
We won’t go into the details of reading the query plan within the scope of this article, we’ll only point out the points worth paying attention to.
- As mentioned in the section on partitioning, check for partition elimination. To achieve partition elimination, the query predicate must have a condition on the sectioning field. Also, the predicate must not contain subqueries or VOLATILE functions.
- Evaluate the order of operations. It is desirable that the largest tables occur as high up in the execution plan as possible to minimize the number of rows passed from node to node. If analysis of the plan reveals that the order of execution is not optimal, make sure the table statistics are up to date. Running ANALYZE will probably help in generating a better plan.
- The spill also shows up in the query plan. In this case, the slice whose execution generated the spill files will be marked with an asterisk *, and the size of the spill files will also be indicated. It should be understood that in heavy queries there is no escape from spill-files, but their presence may be a sign of a sub-optimal query.
- Pay attention to nodes with Sort or Aggregate operations. If you’re performing these operations on a large number of rows, you can improve the performance by changing the query. HashAggregate is better suited for large numbers of rows, and most Sort operations can be replaced by HashAggregate if you rewrite the query. For the scheduler to favor HashAggregate, the enable_groupagg configuration parameter must be set to ON.
- If you notice a broadcast motion operation over a large number of lines in the query plan, you should try to eliminate it. One way is to change the gp_segments_for_planner parameter. This parameter specifies for how many segments the execution plans should be built, and as this value increases, so does the cost of the motion operations. As this value increases, the scheduler will more often choose redistribute motion instead of broadcast motion.
As mentioned in the previous paragraph, Greenplum uses a cost-based optimizer, and in order for the optimizer to be able to adequately estimate the cost of operations and build optimal query plans it needs up-to-date table statistics. Timely statistics, in general, are an essential prerequisite for good performance. The ANALYZE construct and the analyzedb utility exist to collect statistics.
It is important to find a compromise between the accuracy of the statistics and the time it takes to analyze.
ANALYZE doesn’t look through the whole table, it just randomly selects values. The number of these values is controlled by the default_statistics_target parameter, which can take values from 1 to 1000; the default value is 100. Increasing this parameter can improve the quality of plans, especially for columns with irregular data patterns.
Here are recommendations for collecting statistics:
- If you need to update statistics for multiple tables, use analyzedb. analyzedb updates statistics incrementally and in parallel. For append-optimized tables, statistics is updated only if the data in the tables has been changed. For heap tables, statistics are always updated.
- Always use ANALYZE after INSERT, UPDATE, DELETE commands which significantly changed data in a table.
- Always collect statistics after creating indexes.
- If collecting statistics for a large table takes too long, use ANALYZE only on columns used for merging, filtering, sorting or grouping. Also when you change data in one section, it is sufficient to use ANALYZE on that section.
- After adding a section, use analyzedb on the root section.
Database bloat is free disk space that has not been reclaimed for reuse. There is a VACUUM construct to eliminate bloat.
For heap tables, a row is marked as obsolete in the visibility map after a data change, and VACUUM marks obsolete rows as free space and makes them available for rewriting. For frequently updated tables, it is normal to have a small number of obsolete rows, but allowing the table to bloat significantly will slow down I/O and affect query speeds.
It is critical to use VACUUM periodically, especially after UPDATE and DELETE operations on a large number of rows, to avoid having to use VACUUM FULL.
If the number of obsolete rows becomes so large that there is not enough space in the visibility map to keep track of them, you should use VACUUM FULL. VACUUM FULL imposes an ACCESS EXLUSIVE lock, creates a new table, copies the data, and deletes the old table. This is quite a costly operation, and for large tables it can take a long time.
As stated earlier, it is not possible to physically write new data to the append-optimized table. Therefore, VACUUM overwrites the table into new files without obsolete rows; the operation is similar to VACUUM FULL over heap tables.
However, the overwrite occurs only when the percentage of obsolete rows exceeds the amount set in the gp_appendonly_compaction_threshold parameter, which defaults to 10%. Thus, VACUUM can shrink tables on only a fraction of segments.
Laborious reuse of space is one reason why UPDATE and DELETE are not recommended over append-only tables.
Let’s summarize the disk space recovery:
- Use VACUUM after UPDATE or DELETE on many rows.
- Use CREATE TABLE … AS instead of VACUUM FULL and then delete the original table.
- Regularly use VACUUM over system tables, as they are used in most operations and bloating them can degrade the performance of the entire database. If it comes to the need to use VACUUM FULL, you must first stop all activity in the database. When finished, gather statistics on the system tables.
- Never send a kill command to the VACUUM process over system tables.
- Use REINDEX to restore the space used by the indexes.
So, we’ve discussed some of the best practices touching on key aspects of building and maintaining a data warehouse in Greenplum.
We hope that we have not only provided information about best practices, but have also provided insight into the principles and mechanisms on which they are based.