Updated March 8, 2023
What is Redshift Spectrum?
Redshift spectrum is a part of Amazon Redshift Web Services that offers a common platform to extract/view data from its hot data store as well as a cold data store (Legacy data) without having to shift to different software tools. Hot live data is stored in a costlier Redshift data warehouse whereas the cold legacy data accessed sparsely is stored in cheaper data lakes like Amazon S3 bucket.
Redshift spectrum helps many organizations to economize the storage cost by moving the infrequently accessed data away from its main storage and retrieves the quarantined data at a reasonably faster rate without spending much of effort.
Redshift Spectrum vs Athena
Redshift Spectrum is a logical extension of Redshift to query the data from Redshift as well as Amazon S3 data lakes whereas Athena is an exclusive tool to query data from Amazon S3 only.
In the case of Athena, Resources allocation and deallocation are taken care of by Amazon web services while cluster provisioning in Spectrum is handled by end-users. Users will be able to control the cost of Redshift Spectrum.
Performance in Athena depends of the load in the system as it works on shared resources mode with other users, whereas performance in Spectrum is consistent as it runs on an exclusive in-house setup.
Performance in Spectrum can be enhanced by augmenting or optimizing Redshift cluster resources and S3 storage but the performance of Athena is based on S3 parameters only.
Large IT installations can opt for Spectrum wherein a small setup can go in for a simple Athena.
Users will have to manage Clusters in Spectrum whereas it is absolutely serverless in the case of Athena.
Athena charges users based on the queries made and the quantum of data scanned. Spectrum cost is also based on pay per use model. Users can save money by compressing the data, storing it in columns, and partitioning the data.
Amazon Redshift Customers
Amazon Redshift has a strong presence in the Data Analytics space spread across Top 500 companies, several medium-sized companies, and many start-ups. It is omnipresent and omnipotent. Amazon Redshift empowers its customers to gain great insights into the data and get maximum benefits.
Zyanga, Coca-Cola, Wynk Music, Pizza Hut, redhill games, KFC, Big Basket, Nasdaq, and Toyota are some of the big names in the Redshift customers list.
Three Key Concepts of Amazon Redshift Spectrum
How Spectrum works?
Traditionally the live data is stored in a structured database and the data is queried as and when needed through query languages. Redshift offers the warehouse facility to store such data and render query facility. The data not in frequent use is pushed into cold storage and Amazon offers S3 (Simple Storage Services) to store old data. Trillions of objects with terabytes of size are stored in storage services offered by Amazon S3.
Amazon’s Athena, Elastic MapReduce, and Redshift can be used to extract the data from S3 and present the results to the end-users.
1. Athena provides a direct query facility of the legacy data stored in S3. There are several interfaces (APIs) to facilitate these queries. Users don’t have to manage any servers.
2. MapReduce deploys Hadoop kinds of queries to process big data stored in an unstructured manner.
3. Redshift extracts the data from S3 and loads it into the Redshift cluster for further processing (Through the ETL method).
All the above solutions involve high cost and efforts and Redshift Spectrum offers a simple solution to manage to handle hot and
legacy data at the cheapest cost allow users to enjoy the best of both worlds.
1. Redshift queries built for live data in Redshift can be continued to be used as it is
2. It Directly Queries the cold legacy data from S3 without following any of ETL processing
3. The results from S3 and the live data are joined and present to the users
4. A single query would be sufficient to extract the data from the Redshift warehouse and S3 data lake.
5. Users doesn’t have to pay for any extra compute resources to process S3 data and they will have to pay only for the quantum of data queried from S3
Three Key Concepts
The whole deployment Redshift Spectrum works in Virtual Private Cloud mode and the cluster resources are made available to run queries and get results. These clusters are supported by Spectrum and are independent of the Redshift domain.
Three key concepts are
a. Data Catalog for S3,
b. Schema for S3 Data
c. S3 Data Tables
Data catalog holds the schema definition of the organization data stored in S3 Data lakes and it is the central repositories of metadata of the data assets. AWS glue, Athena data catalog, Hive metastore (Amazon EMR) are the various options available in the catalog.
Schema has the information on data tables and other database objects such as views, functions of the S3 Data store.
Tables hold the data and return the data when queried using Select statement in SQL and these tables are read-only and write or update are not allowed. There is no exclusive query for Spectrum. The queries used in Redshift can be used as it is with a change in the database reference to the S3 data store.
Joining Internal and External Tables
Redshift internal tables and S3 external tables can be combined together in a query statement in SQL with clear reference to the source of table, either it is Redshift or S3. Match conditions should be used when matching the tables
Redshift Spectrum Diagram
Pictorial representation of the Data queried from warehouse and S3 and the way they are merged and presented to users
Benefits of Redshift Spectrum
1. Live and legacy data can be accessed in a single query.
2. It offers a cost-effective data analytic solution.
3. Users don’t have to worry about the data they won’t reside in Warehouse or in lakes.
4. Easy to maintain and administer.
5. Offers secured and scalable solution.
Redshift spectrum provides an integrated Business intelligence solution. As the data can be fetched easily from archives, more and more data can be moved to archives if it is not required for transaction processing and the organization need not have to increase the live storage and thus save substantial cost.
This is a guide to Redshift Spectrum. Here we discuss What is Redshift Spectrum? How Spectrum works? benefits respectively. You may also have a look at the following articles to learn more –