Amazon Redshift is a completely managed, petabyte-scale information warehouse service within the cloud. You can begin with just some hundred gigabytes of knowledge and scale to a petabyte or extra. Amazon Redshift lets you use your information to accumulate new insights for what you are promoting and clients whereas preserving prices low.
Along with price-performance, clients wish to handle information transformations (SQL Choose statements written by information engineers, information analysts, and information scientists) in Amazon Redshift with options together with modular programming and information lineage documentation.
dbt (information construct instrument) is a framework that helps these options and extra to handle information transformations in Amazon Redshift. There are two interfaces for dbt:
- dbt CLI – Out there as an open-source challenge
- dbt Cloud – A hosted service with added options together with an IDE, job scheduling, and extra
On this submit, we show some options in dbt that enable you handle information transformations in Amazon Redshift. We additionally present the dbt CLI and Amazon Redshift workshop to get began utilizing these options.
Handle widespread logic
dbt lets you write SQL in a modular trend. This improves maintainability and productiveness as a result of widespread logic might be consolidated (preserve a single occasion of logic) and referenced (construct on present logic as a substitute of ranging from scratch).
The next determine is an instance displaying how dbt consolidates widespread logic. On this instance, two fashions depend on the identical subquery. As an alternative of replicating the subquery, dbt lets you create a mannequin for the subquery and reference it later.
The idea of referencing isn’t restricted to logic associated to subqueries. You can even use referencing for logic associated to fields.
The next is an instance displaying how dbt consolidates widespread logic associated to fields. On this instance, a mannequin applies the identical case assertion on two fields. As an alternative of replicating the case assertion for every subject, dbt lets you create a macro containing the case assertion and reference it later.
How is a mannequin in dbt subsequently created in Amazon Redshift? dbt gives you with the command dbt run, which materializes fashions as views or tables in your focused Amazon Redshift cluster. You may do this out within the dbt CLI and Amazon Redshift workshop.
Handle widespread information mappings
Though you should use macros to handle information mappings (for instance, mapping “1” to “One” and “2” to “Two”), another is to keep up information mappings in information and handle the information in dbt.
The next is an instance of how dbt manages widespread information mappings. On this instance, a mannequin applies one-to-one information mappings on a subject. As an alternative of making a macro for the one-to-one information mappings, dbt lets you create a seed for the one-to-one information mappings within the type of a CSV file after which reference it later.
You may create or replace a seed with a two-step course of. After you create or replace a CSV seed file, run the command dbt seed to create the CSV seed as a desk in your focused Amazon Redshift cluster earlier than referencing it.
Handle information lineage documentation
After you may have created fashions and seeds in dbt, and used dbt’s referencing functionality, dbt gives you with a way to generate documentation in your information transformations.
You may run the command dbt docs generate adopted by dbt docs serve to launch a regionally hosted web site containing documentation in your dbt challenge. While you select a mannequin on the regionally hosted web site, details about the mannequin is displayed, together with columns within the closing view or desk, dependencies to create the mannequin, and the SQL that’s compiled to create the view or desk. The next screenshot reveals an instance of this documentation.
You can even visualize dependencies for improved navigation of documentations throughout influence evaluation. Within the following instance graph, we will see that mannequin
rpt_tech_all_users is constructed referencing the mannequin
base_public_users, which in flip references the desk
customers within the
This submit lined how you should use dbt to handle information transformations in Amazon Redshift. As you discover dbt, you’ll come throughout different options like hooks, which you should use to handle administrative duties, for instance, steady granting of privileges.
For a hands-on expertise with dbt CLI and Amazon Redshift, we have now a workshop with step-by-step directions that can assist you create your first dbt challenge and discover the options talked about on this submit—fashions, macros, seeds, and hooks. Go to dbt CLI and Amazon Redshift to get began.
In case you have any questions or recommendations, depart your suggestions within the feedback part. Should you want any additional help to optimize your Amazon Redshift implementation, contact your AWS account staff or a trusted AWS companion.
In regards to the authors
Randy Chng is an Analytics Acceleration Lab Options Architect at Amazon Internet Providers. He works with clients to speed up their Amazon Redshift journey by delivering proof of ideas on key enterprise issues.
Sean Beath is an Analytics Acceleration Lab Options Architect at Amazon Internet Providers. He delivers proof of ideas with clients on Amazon Redshift, serving to clients drive analytics worth on AWS.