Moving Data From Microsoft SQL Server To Amazon S3

In this guide, we’ll look at moving data from Microsoft SQL Server, one of the most popular OLTP database systems, to Amazon S3, a popular object storage service.

Heading 2

Let's Get Decoding

Start free Talk To An Expert

Data movement is the beating heart of modern data infrastructure. It facilitates seamless information flow across systems to power customer interactions, AI, analytics, and so much more. Powered by Flink and Debezium, Decodable handles both ETL and ELT workflows, eliminating the need for teams to compromise between data processing and movement. Beyond providing a core data infrastructure platform, Decodable addresses the challenge of data movement and stream processing pragmatically, reimagining it in a way that is unified, real-time, and cost efficient. By simplifying the most formidable data infrastructure challenge, Decodable enables teams to focus on their core strengths: innovation and delivering value.

One of the most common data movement use cases is replicating a database to durable, highly-available storage service to support a wide range of applications. In this guide, we’ll look at moving data from Microsoft SQL Server to Amazon S3. SQL Server is one of the most popular online transaction processing (OLTP) database systems. OLTP workloads are optimized for database inserts and storage, with a wide variety of applications including banking, online shopping, and system logging. Similarly, S3 is among the most popular object storage services. Storage systems are optimized for the retention of massive amounts of unstructured data, and are typically used as data lakes and data warehouses to support business applications, photo and video archiving, websites, and more.

SQL Server Overview

Microsoft SQL Server is a relational database management system (RDBMS) that supports a wide variety of transaction processing, business intelligence and analytics applications in corporate IT environments. The latest version integrates with Azure Synapse Link and Microsoft Purview to enable customers to drive deeper insights, predictions, and governance from their data at scale. Cloud integration is enhanced with managed disaster recovery (DR) to Azure SQL Managed Instance, along with near real-time analytics, allowing database administrators to manage their data estates with greater flexibility and minimal impact to the end-user. Performance and scalability are automatically enhanced via built-in query intelligence.

The traditional approach to syncing SQL Server with complementary data stores is batch-based, where pipelines extract data from SQL Server and send it to downstream data stores. Change data capture (CDC) is a modern alternative to inefficient bulk imports. CDC extracts change events (INSERTs, UPDATEs, and DELETEs) in real-time from data stores, such as SQL Server, and provides them to downstream systems which can be applied to a copy of the data.

Amazon S3 Overview

S3 is a fully managed software-as-a-service (SaaS) that provides a single platform for data warehousing, data lakes, data engineering, data science, data application development, and secure sharing and consumption of real-time/shared data. S3 features out-of-the-box features like separation of storage and compute, on-the-fly scalable compute, data sharing, data cloning, and third-party tools support in order to handle the demanding needs of growing enterprises.

The S3 data platform is not built on any existing database technology or “big data” software platforms such as Hadoop. Instead, S3 combines a completely new SQL query engine with an innovative architecture natively designed for the cloud. To the user, S3 provides all the functionality of an enterprise analytic database, along with many additional special features and unique capabilities. S3 runs completely on cloud infrastructure. All components of S3’s service (other than optional command line clients, drivers, and connectors), run in public cloud infrastructures. S3 uses virtual compute instances for its compute needs and a storage service for persistent storage of data. S3 cannot be run on private cloud infrastructures (on-premises or hosted) as it is not a packaged software offering that can be installed by a user. S3 manages all aspects of software installation and updates.

Prerequisites

Using SQL Server with Decodable

Sending a data stream from SQL Server CDC to Decodable is accomplished in two stages, first by populating an AWS Kinesis stream with SQL Server CDC data, and then by adding and configuring a Kinesis source connector in Decodable. To capture a data change stream from SQL Server, a Debezium server is created with an SQL Server CDC connector, which populates an AWS Kinesis stream.

Set up SQL Server
Enable CDC in MSSQL
Create a Kinesis Stream in AWS
Download Debezium server
Download SQL Server JDBC Driver
Edit the conf/application.properties

<div class="side-note">For more detailed information, please refer to the SQL Server CDC example in the Decodable GitHub repository.</div>

Using Amazon S3 with Decodable

Before you send data from Decodable into S3, you must have an Identity and Access Management (IAM) role with the following policies (see the Setting up an IAM User section for more information):

A Trust Policy that allows access from Decodable’s AWS account. The <span class="inline-code">ExternalId</span> must match your Decodable account name.
A Permissions Policy with read and write permissions for the destination bucket.

Create Connectors

Follow the steps in the sections below to get data from SQL Server into Decodable, optionally transform it, and then from Decodable to Amazon S3. These steps assume that you are using the Decodable web interface. However, if you want to use the Decodable CLI to create the connection, you can refer to the Decodable documentation for SQL Server and Amazon S3 for information about what the required property names are.

Create an Amazon Kinesis Source Connector for SQL Server

Once a Kinesis stream has been populated with SQL Server CDC data, we can add and configure a Kinesis source connector in Decodable. Before you can get data from Amazon Kinesis, the following requirements must be met:

You must have an Identity and Access Management (IAM) user with the necessary permissions for reading from the Kinesis stream. See AWS permissions for getting data from Amazon Kinesis Streams.
You must also have the access key ID and secret access key for that IAM user.

Follow these steps to get data from Kinesis into Decodable.

From the Connections page, select Amazon Kinesis and complete the required fields.
Select the stream that you’d like to connect to this connector. Then, select Next.
Define the connection’s schema. Select New Schema to manually enter the fields and field types present or Import Schema if you want to paste the schema in the form of an Avro or JSON array. The stream's schema must match the schema of the data that you plan on sending through this connection.
Select Next when you are finished providing defining the connection’s schema.
Give the newly created connection a Name and Description and select Save.
Start your connection to begin processing data from SQL Server.

Create an Amazon S3 Sink Connector

From the Connections page, select the Amazon S3 connector and complete the required fields.
Select which stream contains the records that you’d like to send to Amazon S3. Then, select Next.
Give the newly created connection a Name and Description and select Save.
If you are replacing an existing Amazon S3 connection, then restart any pipelines that were processing data for the previous connection.
Finally, Start your connection to begin ingesting data.

Additional information is available in the Decodable documentation to help with:

At this point, you have data streaming in real-time from PostgreSQL to S3!

Processing Data In Real-time With Pipelines

A pipeline is a set of data processing instructions written in SQL or expressed as an Apache Flink job. Pipelines can perform a range of processing including simple filtering and column selection, joins for data enrichment, time-based aggregations, and even pattern detection. When you create a pipeline, you define what data to process, how to process it, and where to send that data to in either a SQL query or a JVM-based programming language of your choosing such as Java or Scala. Any data transformation that the Decodable platform performs happens in a pipeline. To configure Decodable to transform streaming data, you can insert a pipeline between streams. As we saw when creating an S3 connector above, pipelines aren’t required simply to move or replicate data in real-time.

Create a Pipeline Between the SQL Server and S3 Streams

As an example, you can use a SQL query to cleanse the MSSQL data to meet specific compliance requirements or other business needs when it lands in S3. Perform the following steps:

Create a new Pipeline.
Select the stream from SQL Server as the input stream and click Next.
Write a SQL statement to transform the data. Use the form: <span class="inline-code">insert into <output> select … from <input></span>. Click Next.

Decodable will create a new stream for the cleansed data. Click Create and Next to proceed.

Provide a name and description for your pipeline and click Next.
Start the pipeline to begin processing data.

The new output stream from the pipeline can be written to S3 instead of the original stream from SQL Server. You’re now streaming transformed data into S3 from SQL Server in real-time.

Conclusion

Replicating data from OLTP databases like SQL Server to S3 in real-time helps you to manage storage costs, meet regulatory requirements, reduce latency, and save multiple distinct copies of your data for compliance requirements. It’s equally simple to cleanse data in flight so it’s useful as soon as it lands. In addition to reducing latency to data availability, this frees up data warehouse resources to focus on critical analytics, ML, and AI use cases.