Tracking User Behavior At Scale with Streaming Reactive Big Data Systems

Deadshot

Well-known member
Jan 23, 2019
94
10
8
#1

Behavioral Analytics through Big Data Applications can be used to gain insights, analyze product behavior and provide better customer experience.
Did you know that PayPal’s product performance tracking platform processes over 18.8 billion messages per day making it one of the busiest systems in PayPal?​
Data is alex at PayPal. Accurate data allows for more informed and better decisions. PayPal customer data is a treasure trove and is a major competitive differentiator. Our belief is that every employee should be empowered with timely insights to make data-driven decisions. We are at a vantage point to make this a reality. To meet these increasingly ambitious time-to-insightsdemands, our behavioral analytics platform will need to be agile and friction-less through the data collection, data transformation and insights generation phases.
To enable this, we revamped our big data platform taking up to 9 billion events per day from a Spring based custom framework of PayPal to Squbs framework (based on Akka Streams).
Companies are moving from systems designed for data “at rest” in data warehouses and embracing the value of data “in motion”. Fast Data processing — high velocity data processing at near real time is the new trend. Akka Streams is a leading stream processing framework handling Fast Data.
Here are the Key Outcomes we observed after moving to a Simple, Fault-tolerant and Resilient architecture:


Additional Benefits:
  • Availability — Improved by back-pressure driven architecture.
  • Reduction in processing time and reduced failure rate.
  • Enablement — Reduced Deployment time from 6 hours to 1.5 hour.
  • Reduced Operational expense on support and maintenance.
Now the obvious questions that come to our minds is:
How did we do that? How was PayPal able to achieve massive concurrency with Reactive Streams? How did we start thinking in Streams?​
The Reactive data processing pipeline collects the user behavior data for web and mobile PayPal products and provide us with enriched data to derive actionable insights. In this collector, each of the HTTP socket connections is its own stream as defined by Akka HTTP. We still want to siphon all the data into the main enrichment stream. Since the connection stream can come and go as we create new connections and close the connections, the merge point has to allow for change of these streams that feed into the enrichment flow. This can be achieved by using a dynamic fan-in component called MergeHub. The enrichment stream just enriches the beacons and sends them to Kafka using the sink provided by the Reactive Kafka API.


Big Data Collector and Enricher Flows
But, there is a problem in this picture. Can you guess what that problem is?
The problem is the back-pressure. Kafka does re-balancing and re-partitioning from time to time. Now we have a completely back-pressured stream. So, what do we do when Kafka is unavailable? We can’t possibly back-pressure the internet traffic, right?
For this reason, we insert another component that breaks the back-pressure from the Kafka sink. We insert a buffer here. But since a buffer can cause you to go out-of-memory, we built a persistent buffer — a buffer that writes the elements to disk via memory-mapped files. We are using a technology called Chronicle Queue behind this buffer.


Optimized Big Data Collector and Enricher Flows
In the normal case, elements would just flow through the persistent buffer and we do not really worry about it at all. Nothing is really kept in that buffer. Once Kafka becomes unavailable, the persistent buffer will now keep the elements in local disk and only back-pressure when we reach a high-watermark. We should be able to survive any re-balancing or re-partitioning of Kafka given we have adequate storage. Once Kafka reports back being available, we’ll now push all those elements to Kafka and the stream keeps continuing without data loss.


PerpetualStream — Enrichment Flow code
You can see the components lining up very much reflecting the diagram we had previously drawn up. The only one part we did not mention is the map. This just converts the enriched beacons to a Kafka ProducerRecord.


FlowDefinition for Http Flow
Now the HTTP Flow, the HTTP part of the stream. This one gets materialized for every single connection. It looks up the enrich stream through its materialized value — guess what — that MergeHub.source actually materialized to a sink we can look up. Then it creates a normal HTTP flow which deserializes the beacon, pushes it to the enrichment flow, and at the same time creates an HTTP response responding back to the beacon generator. The key piece of this flow is this alsoTo(enrichStream) which pushes the beacons to the enrichment stream. This is in essence, a fan-out — implicitly.
When thinking about reliability, starting up and shutting down the app is usually overlooked. When app is shutting down, we should not pull the carpet under the streams and let it crash. Instead, the stream should be gracefully shutdown and all the messages should be drained. Use PerpetualStream and also Akka Killswitch to gracefully shutdown a stream and drain messages.
The riskier part of a stream (e.g., external communication) should be gat ed, with Retry, Circuit-breaker, or Timeout.
We faced a lot of technical and organizational challenges of converting a well-established team into this reactive mindset. But we believe that:
 

Log in

Online statistics

Members online
1
Guests online
64
Total visitors
65