This hypothetical and high level architecture diagram will explain how we can use OSS technologies more effectively. We are going to discuss about this whole solution under four main subject areas.
- Distributed Streaming Message Broker (Data Pipeline)
- Streaming ETL & Streaming Aggregation
- Time Series Data Storing & ML Processing
- Dashboard & Notification
Distributed Streaming Message Broker
We use Kafka as our data pipeline, you can get more details about Kafka from here [1] .
Kafka is proven technology which perform as distributed streaming log. Using it's partitioning technique we can scale up easily. It's perfectly ideal for event driven architecture because it's streaming nature. Kafka make it much more likely that disk access is often sequential and it utilized OS page cache efficiently [2].
Streaming ETL & Streaming Aggregation
Flink will be the stream processing platform that we use for aggregation, ETL and CEP. Flink by-design support stream processing and it's widely use streaming technology at the moment. Flink is based on the DataFlow model which means, Flink is processing the elements as and when they come rather than processing them in micro-batches (which is done by Spark streaming).
Uber AthenaX is a SQL based streaming analytics framework [3]. It's combination of YARN , Calcite & Flink. AthenaX give APIs to monitor, access & administrate life cyclone of Flink Jobs. Also because of Calcite we can write SQL based stream processing applications and run them inside AthenaX easily.
Time Series Data Storing & ML Processing
We use Cassandra as time series data storage. Cassandra architecture is ideal for this purpose because of sequential writing to disk, which will help for fast reading large set of data.
So the time series prepared data we got from streaming analytics platform, will be stored in Cassandra for machine learning.
We will train several ML models using prepared data, then trained models/knowledge base will be stored (as binary) in Cassandra.This way we can dynamically select trained models when we analyze real time data.
Spark slave/worker and Cassandra will reside in same host, and both will be connected by Spark-Cassandra-Connector. For high availability there will be active and stand by Spark master while Zookeeper keep the state of each master.
The main advantage of this approach is guarantee of data locality between Cassandra node and Spark slave, which will cause for high performance data fetching.
Dashboard & Notification
Finally you need to store analyzed data for later use (Dashboard & trend analysis). Say for example, If we use this solution for predictive analysis, you can save predicted data in ElasticSearch. By using ELK stack, people can create Dashboards easily using analysis data with Kibana.
Also there are lot of other features in ElasticSearch & Kibana such as ES alerts, Trend analysis, Searching and etc...
References:
[1] https://kafka.apache.org/
[2] https://stackoverflow.com/questions/45751641/how-does-kafka-guarantee-sequential-disk-access
[3] https://eng.uber.com/tag/athenax/
[4] https://www.infoq.com/presentations/uber-ml-architecture-models?utm_source=infoqemail&utm_medium=ai-ml-data-eng&utm_campaign=newsletter&utm_content=05152018
No comments:
Post a Comment