September 20, 2016 | Written by: trossman
Share this post:
It’s not just malicious users that can hurt your application or service. Sometimes well-intentioned users will put a stake through your application’s heart, taking your service down and triggering alerts to wake you up in the middle of the best dream you’ve had since you were a kid.
Our logging service has two different entry points through which our users interact with the service. The first entry point is used by agents to send logs to the service. We call this log ingestion. The other entry point is used to query the log data. As we learned the hard way, it turns out each of these entry points was also an attack surface that can be exploited by the malicious or accidental user.
Strategies for identifying dangerous traffic
This post continues the story introduced in my prior blog post “Performance, Scale and Reliability“.
A couple months ago, we were ingesting about a terabyte per day in one of our busiest datacenters. Today, we’re ingesting about 3.5TB per day. This growth didn’t just sneak up on us – it hammered us in the middle of the night during long weekends! One of the first times our ingest traffic surged, we noticed that the bulk of the traffic was coming from a single tenant! Ah, the blacklist is born.
Blacklists are a really great tool to protect your service. Fundamentally, any traffic that matches the blacklist is rejected as early as possible within the system. In our logging service, we have two levels of blacklisting. First, on our load balancers, we can block traffic from source IP addresses. We have a cluster of lumberjack servers that process all ingestion traffic including SSL termination, authentication, etc. While it is more compute intensive deeper into our system, this is a perfect place to blacklist individual tenants.
One of the simplest ways to identify a dangerous tenant that could hurt our service is to simply put a cap on the amount of traffic the tenant can send us. By default, we give every tenant 1GB of logs per day. As soon as a tenant hits their cap, we stop accepting their logs. Our lumberjack servers are instrumented to count the amount of data sent by each tenant and they automatically blacklist the tenant for the rest of the day.
Since most of our client traffic comes from logstash forwarders, it’s worth looking closely at its behaviour. It turns out that when they cannot successfully send logs to the service, they will simply buffer up the logs and try to send again later. Now imagine if our capped blacklist simply rejected all connections from the tenant! Well, the agent would hold on till the next day when the cap is reset. Then, all of a sudden, this agent will hammer us in the middle of an awesome dream.
There are a number of quite reasonable scenarios where a client might suddenly send a big surge of logs that would otherwise clog up our system such that we start missing our service levels in terms of time before logs are query-able. To address this, we introduced a classic throttling solution. Again, the lumberjack cluster was the perfect place to implement this. It turns out that the lumberjack protocol requires an “ack” from the service before sending the next chunk of logs. By delaying the “ack”, we are able to throttle clients, which in many cases is enough to avoid breaking SLAs because of one or two bad apples.
Protect your service!
I hope you’ve seen how blacklists, capping, and throttling can be valuable tools to protect your service from malicious users. You may recall that I said that both our entry points can be lethal, but I’ve only mentioned ingestion so far. As you can imagine these same tools can be applied to that nasty little attack surface too. Next time I’ll tell you about some more fun times in the middle of the night.
For more details about the logging service, see this information in the documentation: