The Firehose is Out, Syslog is In | Part 1

A new take on logging for VMware Tanzu Application Service and Cloud Foundry

This blog was written by Matthew Kocher and Nick Kuhn.

In this blog, we will recap the current logging infrastructure of Cloud Foundry and VMware Tanzu Application Service and deep dive into what we are doing to improve the overall logging architecture and platform engineering experience.

Let’s review the current state.

In the early days of Cloud Foundry, logs were archived to a file on disk in the container.

That was a bad time – logs would disappear if an app restarted and fill up disks if the app didn’t restart.

Then the Loggregator subsystem was dreamed up to make developers' lives easier. Removing the pain of the old system, Loggregator also allowed developers to view their app’s logs through the cf CLI. Developers were also allowed to send logs to third-party log management services for archiving on a per-app basis via user-created syslog drains.

But what about platform engineers? Initially, Cloud Foundry platforms were large, multi-tenant affairs whose engineers thought of archiving app logs only as a liability to themselves. However, Cloud Foundry was quickly adopted by large companies and organizations who wanted the productivity boost of user experience like that of a public-cloud managed service, with all the control and flexibility of running the service themselves.

With this new use case - came the desire to archive all the logs from every app in the system, and the Firehose was born.

The Firehose API is provided by Dopplers and Traffic Controllers and accessed by custom code we call Nozzles. Nozzles connect to the Traffic Controller and request a shard of logs be sent to them. For each nozzle connection the Traffic Controller receives, it creates a corresponding array of connections to each doppler, resulting in persistent connections which look like this:

When a doppler receives an envelope of logs or metrics, it loops through every nozzle it knows about and, for each one, picks one of the nozzle instance connections at random to send the envelope to. In this way, messages are sent to every nozzle and evenly sharded across the nozzle instances.

The crux of the issue here is the fanout problem - for each connection to a traffic controller asking for logs, the traffic controller creates corresponding connections to each doppler. So, the number of connections in the system is M*N, where M is the number of Nozzle instances, and N is the number of dopplers. Multiplicative growth in the number of connections when scaling horizontally is not a good long-term plan as it creates a ceiling on the size of Loggregator clusters and, with that, a limit on the number of logs a Cloud Foundry installation can handle.

When looking at this dilemma, there are (at least!) two possible long-term solutions.

Sharding the logging infrastructure and assigning apps to a specific cluster.
Distributing log destinations for an application directly to the node generating the logs and avoiding having a doppler/traffic controller routing mesh entirely.

Both solutions address the M*N connection problem with trade-offs.

If you’ve been following Cloud Foundry development for a while, you may already know that we didn’t really pick between these; we did both - Log Cache and Aggregate Syslog Drains.

For many use cases like the CLI tailing app logs, we’ve switched to Log Cache, where AppId’s are hashed and logs are sharded deterministically to a given node. This works great for answering API requests like “give me the last 200 lines from the prod-mobile-api app”.

We are moving toward a future without the need to maintain the firehose.

To replace the firehose, we’ve built Aggregate Syslog Drains. Log Cache itself recently switched to using Aggregate Syslog drains to receive log and metric messages by default.

Logs and Metrics egressed by the syslog agent are sent in RFC-compliant syslog format and tagged with a substantial amount of metadata about the application using syslog RFC 5424 tags. These tags include such niceties as the app, space and organization, and process type. Egress through the syslog agent means that logging capacity scales inherently as the components generating logs are scaled, and the logging data flows in a direct path from where it is generated to each configured log storage system.

Why does this new future matter to you?

We’re excited to say that we’re confident Aggregate Syslog Drains are the future for platform engineers who want to archive all the logs and metrics coming from a Cloud Foundry installation, and we encourage engineers to start moving from nozzles to aggregate drains. Furthermore, by migrating to Aggregate Syslog drains, platform engineers can start to realize the benefit of reducing the overall virtual machine infrastructure footprint they need to maintain within each Cloud Foundry or Tanzu Application Service foundation.

The Firehose is Out, Syslog is In | Part 1