Public Cloud Data Streaming Comparison: Alibaba Cloud, AWS, Azure, Google Cloud, IBM Cloud
Streaming data is becoming the next wave in the data analytics and machine learning landscape. The key reason behind it is that processing only large volumes of data is not sufficient but the ability to process it in a short period of time and making real-time insights out of it is essential so that a business can react to the changing environment in real-time.
The trend of cloud computing requires the streaming data processing engines to be highly scalable and robust towards faults. Cloud-based data stream processing systems, in particular, are made to scale dynamically to hundreds of computing nodes and cope with diverse workloads automatically.
Understanding the importance of data streaming with the increasing variety of different use cases, organizations are adopting hybrid platforms so that they can leverage the advantages of both – batch and streaming data analytics.
To help enterprises in determining the best data streaming services, we have compiled a list of the most-feature-rich tools for you and your business.
Alibaba Cloud DataHub is a real-time data distribution platform designed to process streaming data. It offers features such as publish, subscribe, and distribute streaming data. It helps to easily analyze and create applications based on streaming data.
Based on Alibaba Cloud’s Apsara platform, DataHub delivers high availability, low latency, high scalability, and high throughput. Seamlessly connected to Alibaba Cloud’s stream computing engine, StreamCompute, DataHub allows you to use SQL to analyze streaming data. It can also distribute streaming data to various cloud products, such as MaxCompute (formerly known as ODPS) and OSS.
See architecture of Alibaba big data demo system.
Source: Alibaba Cloud
In the figure, the architecture comprises a data source system, a data warehouse, a big data platform, a web/app platform, process scheduling, data processing and a real-time data streaming platform. Here, real-time data is processed through DataHub + StreamCompute.
With this, varied data processing results are produced on real-time basis, involving real-time charts, statistics, and other information. Overall, Alibaba’s DataHub is great if you want to stream complex data.
|Concepts||Alibaba Cloud DataHub|
|Data Retention||Default – 24 hours|
|SDK Support||MaxCompute Tunnel SDK|
Read reviews of Alibaba Cloud.
AWS Kinesis processes data in real-time. The key feature built-in Kinesis is its potential to process hundreds of terabytes of data streams in high volume per hour. It has the power to simplify the process of development of certain apps through real-time decision making on business operations with streaming data.
it consists of key concepts for stream storage and an API to implement data producers and data consumers. The data producer sends the data as they are generated, and the data consumer retrieves the data in a stream as it is generated.
AWS charges are based on per hour basis of each stream work partition and per volume of data that flows through the stream.
See the diagram below summarizing key concepts of Amazon Kinesis.
When it comes to features, Amazon Kinesis supports Android, Java, Go and .NET. When it comes to performance, it writes each message synchronously to three different machines. However, it allows only days/shards for configuration.
|Data Warehouse||Athena, Redshift|
|Data Retention||Default – 24 hours, 1-7 days (maximum 7 days)|
|SDK Support||AWS SDK supports Android, Java, Go, .NET|
|Real-time Store||Amazon DynamoDB|
|Cost||Pay and use|
Read reviews of AWS Kinesis data streams.
Stream Analytics by Azure is a fully managed, event processing engine for real-time analytics, be it a data stream or multiple streams from sources such as social media, sensors, web data sources, and other applications. It delivers low latency, high throughput, and high scalability.
Stream Analytics is designed on a pull-based communication model that offers built-in recovery and checkpointing abilities. The service can also protect data from downstream failure. It supports input types: Stream and Reference data and source types: Azure Event Hubs and Azure Blob Storage.
The diagram summarizes how data is received, analyzed and sent for other actions in Stream Analytics.
The Event Hubs in Stream Analytics can integrate millions of events per second of various formats. Blob Storage can also store data and direct it to Stream Analytics for operations. Currently, Stream Analytics is charged on the basis of volume of data processed and the number of stream units used.
|Concepts||Azure Stream Analytics|
|Data Warehouse||Azure SQL|
|SDK Support||Management .Net SDK|
|Real-time Store||Azure CosmosDB|
Read reviews of Azure Streaming Analytics.
Cloud Dataflow is a managed, data processing service that uses data pipelines to ingest, transform and analyze both real-time and batch data. Based on Apache Beam, the service supports Python and Java jobs.
In Dataflow, the events pass through three steps: validation, enrichment, and ingestion. This service streams, processes and stores over 120,000 events per second with a very low latency. Every incoming event is validated and written in partitioned tables in BigQuery.
See the process of dataflow stream and batch processing below.
Google Cloud Dataflow is a great choice for organizations willing to do production-level data processing in the cloud. Users are charged in per-second increments which is based on the actual use of the service. Any other additional Google Cloud resource consumption is billed per that service.
|SDK Support||Apache Beam SDK|
|Real-time Store||Cloud Bigtable|
|Cost||Based on the actual use of Dataflow batch or streaming workers|
Read reviews of Google Cloud Dataflow.
IBM Streaming Analytics can manage high data rates and perform analysis with low latency. It can be used to ingest, analyze and monitor data coming from real-time data sources. With IBM Streams, companies can view information and events as they unfold.
The image below summarizes IBM’s Streaming Analytics’ architecture.
The architecture offers dynamic approach to resource allocation, i.e. organizations can define the maximum number of nodes required to use in their environment, and the service will scale up or down accordingly. This ensures that a company pays only for the resource it uses, while effortlessly monitoring, managing and making informed decisions.
|Concepts||IBM Streaming Analytics|
|Data Warehouse||IBM Db2 Warehouse|
|SDK Support||Eclipse SDK|
|Real-time Store||IBM Cloud Object Storage|
|Cost||Based on instance per hour|
Read reviews of IBM Streaming Analytics.
The time is NOW!
The streaming data architecture is in a constant evolution phase. So, before running off to pick any of these solutions, it is important to get a deep understanding of the existing systems and get a clear picture of it. It would be best to note that all of them are great at what they do in their way.
The question however is which one is right for you. To answer this, you must go through the features of all of them and see which one suits best according to your use case and available resources.
Brief comparison: Alibaba Cloud vs AWS vs Azure vs Google Cloud vs IBM Cloud
|Concepts||Alibaba Cloud||AWS||AZURE||Google Cloud||IBM Cloud|
|Data Warehouse||MaxCompute||Athena, Redshift||Azure SQL||BigQuery||IBM Db2 Warehouse|
|Data Retention||Default – 24 hours||Default – 24 hours, 1-7 days (maximum 7 days)||–||–||–|
|SDK Support||MaxCompute Tunnel SDK||AWS SDK supports Android, Java, Go, .NET||Management .Net SDK||Apache Beam SDK||Eclipse SDK|
|Real-time Store||ApsaraDB||Amazon DynamoDB||Azure CosmosDB||Cloud Bigtable||IBM Cloud Object Storage|
|Cost||Pay-As-You-Go||Pay and use||Pay-As-You-Go||Based on the actual use of Dataflow batch or streaming workers||Based on instance per hour|