Updated March 8, 2023
Introduction to Redshift Copy
Redshift Copy facilitates the transfer of data from external sources into a Redshift database in a data warehouse, which is a part of Amazon web services, a cloud computing platform.
The external sources can either be:
- Amazon S3 data lake that stores cold legacy data in a structured and unstructured format.
- DynamoDB table in Amazon, part of a NoSQL database consists of key values and document data structures.
- EMR cluster (Amazon Elastic map-reduce) is a big data platform which uses open-source frameworks like Spark, Hive and Presto for processing data.
- Remote host through secured shell (SSH) connectivity.
Three Topics in Redshift Copy
The main three topics are:
- Mandatory Parameters
- Optional Parameters
- Other General Information
1. Mandatory Parameters
These parameters are minimum mandatory parameters required to carry out a command operation Amazon Redshift arena and they are:
- Table name to which the data has to be transferred.
- Data source from where the data is copied.
- Authentication to ensure the needed authorizations to access the data sources while copying.
A typical copy command with mandatory parameters looks like this:
copy Dest_Table_name from Input_Data_Source Authentication
- Dest_Table_name: This table should be present already in the Redshift data warehouse. If it is not there it has to be created using create command. This create command should define various columns and their data format. This table may be a regular permanent table or a temporary one. If the table contains some data already then the copy command appends data into the table name at the end.
- Input_Data_source: The source of the data from where the data has to be copied should be mentioned clearly in the copy command. It should be one of the four sources mentioned above. Source name, pathname in the source system and exact file name with the extension should be clearly mentioned for the copy command to get executed successfully.
- Authentication: The copy command requires authentication to read the data from any of the data sources and the authorization modalities should be clearly mentioned in the command. The authorization can be with reference to the role assigned to the cluster, the user operates (role-based) or the key-based as an individual user (key-based). In role-based authorization, it is the identity and access management role (IAM) that gives permission for accessing the files in the source and in the key-based authorization it is based on the IAM User Id. The IAM role and IAM user id will have secured credentials for gaining access. From the flexibility and security angle, it is recommended to use role-based access.
2. Optional Parameters
A typical copy command with optional parameters.
Copy Target_table [Column_mapping] from input_source authentication
[Format …] [Parameter [arg1]…]
If the source table from which the data is copied is identical to that of the destination table in a Redshift data warehouse, then a simple copy command from the source to the destination would be sufficient to complete the copy operation. It may not be so always. Sometimes the source file can be plain text or Excel or comma-delimited etc.
In most the cases, the data description (columns, size, datatype) of the source table need not be the same as that of the destination. Users would like to transform and enrich the data as per the application’s requirement while copying the files.
Optional parameters play a critical role in managing the differentials in the input and output tables. In the copy, command users can mention the input vs output column matching information, data formats of the source data and its corresponding format in the destination, the way the data should be parsed while copying and the list of changes required (transformations) while converting the data and the way the data should be loaded.
- Column Mapping: The copy command, by default, will insert the columns from the source table into the destination table in the same order as it is found in the source table. Otherwise, a column to column mapping between the tables will have to be declared to copy command. JSONPath can also be used to map the columns between tables.
- Data Format: The source file can be a text file or excel or fixed width or a character delimited file or Avro or JSON or many other formats. The format will have to be specified while copying.
- Data Transformation: Data on the source side can be converted into a different data type on the destination side. The data conversion details will have to be clearly mentioned in the copy command. For example, text data in the input side can be converted into a data field in the output. Similarly, text data in the source can be converted into time format in the target table.
- Data Loading Pattern: To optimize the loading speed and data retrieval time the default loading pattern can be tweaked accordingly while issuing copy commands.
3. Other General Information
- It deals with copying a specific type of files such as favourite movies, listing from various sources, event files, timestamp information, venue etc. This section handholds users in troubleshooting issues in data loading.
- Usage notes provide support to end-users on copy from columnar data, copy from JSON format, details on permission to access other amazon sources, copying with Access aliases in S3, loading Geometry and HLLSKETCH data types and managing Date/Time format strings.
Examples of Copy Command
Data is copied from Amazon S3 most commonly but the data is sourced from EMR, DynamoDB and remote Host as well.
a. Copy from Amazon S3
copy supplier from `S3://myfolder/supplier’ iam_role `xxx:aws:iam:202837352:role/myrole’
[optional parameters]
b. Copy from Amazon EMR
copy supplier from `emr://h-mysupplier/supplier-*’
iam_role `xxx:aws:iam:202837352:role/myrole’
[optional parameters]
c. Copy from Amazon DynamoDB
copy mymovie from `dynamodb://Moviestation’
iam_role `xxx:aws:iam:202837352:role/myrole’ [optional parameters]
d. Copy from Remote Host
copy supplier from `S3://myfolder/supplier’ iam_role `xxx:aws:iam:202837352:role/myrole’
ssh [optional parameters];
Conclusion
Flexibility and variety offered by Redshift copy enable users to seamlessly transfer data from several Amazon sources into a central data warehouse for further processing. Availability of data in a centralized location helps developers to build new business cases and meet different requirements of users.
Recommended Articles
This is a guide to Redshift Copy. Here we discuss the introduction and three topics in redshift copy for better understanding. You may also have a look at the following articles to learn more –