Beefing Up Redshift Performance
MPP is an predestined tool for any Data Warehousing and Big Data use case. Amazon Red Shift overhaul all of its peers in its space due to its ease to use, performance and scalability. Optimization can be easily termed as the key steps in Data warehousing and Big Data world.
The Accompanying Red Shift primitive optimizations techniques will helps in tackling uneven query performance.
Prioritize Compression
Compression impacts the Redshift cluster performance in top level. Vital effects of compressions are:
- Reduce storage utilization.
Compression on columns significantly bring down the size footprint of data. I.e.. Redshift cluster utilizes less amount of disk
2. Improve query performance.
Data scan and joins will be efficient as there involves less I/O operation.
Choosing Compression Encodings
Choosing the right encodings for the column type is vital for the performance of the table. It is advisable to skip the encodings, if the columns have a lot of non-repeated values or the columns character count is in lower side as the decompression overhead would have a greater effect than the optimization benefits.
If the consecutive column values vary or differ by a
single byte like unique consecutive integers, Delta encoding would do better.
AZ64 is an AWS property encoding algorithm that is efficient for numeric, date and time data types in both storage and output, as it uses SIMD for parallel processing.
Follow this doc to have a clear picture of encoding
Testing Compression
To analyze the compression effectiveness of the existing table use the “ANALYZE COMPRESSION” statement
It is always recommended to use Copy command with COMPUPDATE set to on when loading data into redshift as it applies automatic compression.
Materialize columns
For queries pattern which can be foreseen or queries which are often repeated it is best to make use of Materialized Views. Downstream applications can query/read the data available in the pre-computed materialized view rather than running resource-intensive queries over large tables.
It is almost certain that the data in the base table changes over time, run “refresh materialized view” to refresh the materialized view.
The benefit of materialized views is that both Redshift tables and external
tables have the ability to store the result set of a SELECT query.
It would be much more effective to reference materialized views as they
prefer to use the pre-computed results instead of accessing the external tables.
AQUA
AWS Redshift’s new feature which is in preview at the moment but promises to run Redshift queries up to ten times faster than its peers. Aqua has AWS specially designed data processors in place and it also acts as cache bump and a sub-query offload system which reduces the workload of Redshift cluster to make it more efficient. AQUA runs certain repetitive queries and it preprocess the datasets beforehand delivering it to Redshift.
Wary of DistKey Utilization
Need to wary of when choosing between “Even” and “Key” based distributions as key based distribution may lead to data skew and uneven query performance on specific queries which may need to join with another key that is not co-located.
Copy Compressed Data
The preferred way of copying data into Amazon Redshift is the copy command as it leverages the MPP architecture of Redshift to transfer data. The COPY command can be used on the top of files from various sources, like EMR, DynamoDB and S3.
Compressing files in S3 when loading large amounts of data will accomplish three goals:
-
- Faster file upload to S3
- Lower S3 storage utilization (cost)
- Faster load process since compression de-compression can happen as files are read.
Redshift Spectrum
As the volume of the data grow, it will complex for maintaining a Data-warehousing system. Redshift offers a comprehensive service to curb down data which are infrequently used from the cluster and load the same on demand from AWS S3. It helps the users to keep the cost under check and allows to keep the Redshift cluster with minima load.
Significance of Statistics
Redshift’s has its own query execution engine which uses statistics about tables to build out a query execution plan. Based on those statistics, the query execution engine picks out the optimized plan among the many plans it generates for execution. Hence it’s always recommended to ANALYZE the table frequently. However the frequency to run ANALYZE should be figured out only after doing a cost based analysis as it is a costly operation.
Create Custom Workload Manager (WLM) Queues
Easily the most vital component to Redshift query performance is efficiently managing Amazon Redshift workload manager. The queries in Redshift will run in a queuing model. The default WLM configurations won’t come into handy in most of the situations. Customizing these WLM workloads has its advantages:
-
- Can able to scale out workloads providing adequate resources (e.g. concurrency and memory)
- Separation of queries and workloads for different type of workloads
Amazon Redshift allows the users to define up to 8 queues with a sum of up to 50 slots to separate the workloads. By setting up the queue assignment rules queries can be routed into certain queues.
For WLM configurations, it is recommend to follow the following process:
-
- Separate users
- Define workloads
- Group users into workloads
- Select slot count & memory % per queue
Use Short Query Acceleration (SQA)
It is always common that a Data Analyst/Scientist will always in need of firing a mixture of big and small queries. In these scenarios it makes sense to execute these small queries rather than waiting behind intensive queries. Short Query Acceleration by leveraging Machine Learning speed up the execution of these Short Queries.
Zone Map & Sort Key
Zone Maps maks redshift execute queries faster by include or exclude data quickly, without actually looking at the data. This also results in reduced I/O from the nodes. However, if the data is not sorted this results in no effect as zone map works on min-max ranges of each block and if the data is left unsorted these ranges will overlap.
Efficient queries
Another thing you’ll want to check is if your queries are efficient. For example, if you’re scanning an entire dataset with a query, you’re probably not making the best use of your compute resources.
A few tips for writing performant queries:
-
- Consider using
INNER joins
as they are are more efficient thatLEFT joins
. - Stay away from
UNION
whenever possible. - Specify multiple levels of conditionals when you can.
- Use
EXPLAIN
to show the query execution plan and cost.
- Consider using
Miscellaneous optimizations
-
- Scale up cluster. During peak times if the query performance is found to be on slower side, toss up more nodes to the cluster.
- Scheduling vacuum. If there are frequent update and delete operation on the Redshift Cluster, to reclaim the disk space which are occupied by the rows mark for deletion, schedule a regular vacuum for your Redshift cluster.
- Use Redshift Advisor: It is always recommended to get some hand from the Redshift advisor as the recommendations from it will helps to improve the efficiency of the cluster as the recommendations are derived from the statistics and observations over operational data.
Conclusion
Even though this piece helps you to be aware of most of the primitive tuning operations of Redshift, but still like any other AWS offerings it is always good to be up-to date on all the recent changes and new features of AWS Redshift by closely observing AWS blogs and Reinvent to achieve the maximum of Redshift.