heroku-cc/README.md

3.5 KiB
Raw Blame History

heroku-cc

How to run:

$ gzcat /path/to/fixture.log.gz | go run main.go > results.csv
$ diff -s results.csv their/results.csv

Questions

  1. Our fixture file is very small in comparison to the actual number of request log lines that we receive. Discuss how the choices you made for your solution, e.g._ language, algorithms, data structures, would (or would not) be able to handle 100x, or 1000x more requests, and 50x, or 500x more hosts and still deliver summaries in near real time.
  2. It turns out that averages are not a very good metric for measuring performance because one long running request can skew the average. Instead, we probably want to use percentiles. How could we support storing the median, 95th, and 99th percentiles in your solution?
  3. While 1-minute data is great, it's hard to plot multiple days worth of data at 1-minute resolution. How can we obtain 10-minute, or 1-hour summaries instead?
  4. Our support team suggests that an average response time > 2 seconds is much too slow. How would you write a program that read the output file and printed "HOST IS TOO SLOW AT ." if the last 5, 1 minute summaries for a host were greater than 2 seconds?
  5. In the real world, we dont have guarantees about clock monotonicity, and we see data arrive late_ . That is to say that there is no hard guarantee that, for L_ =request log line number and T_=the timestamp, T.sub(L-1)_ is chronologically less than T.sub(L)_. Describe a way in which your program could handle request logs that arrive up to N minutes late. What is the impact on memory usage, and “downstream” services such as the “SLOW RESPONSE TIME” program discussed above?

Answers

  1. I wrote this in Go which helps in terms of the execution speed, but this kind of data is just fundamentally intensive to parse and plot. I'm farily sure that a 100 or 1000 times increase in request count will cause this exact solution to fall over only if there is more requests than can fit in ram at any given time. However this solution is pretty O(n), so any additional requests will result in a higher processing time.
  2. The median can be calculated by sorting the bucket elements by the amount of time the request took. As for the percentiles, you can calculate those by sorting the bucket element data by the time the associated request took and then do the standard statistics percentile algorithm.
  3. You can glom multiple buckets of requests together by just collating the results per minute and building up data that way.
  4. Scrape out of the CSV file and store by host and then by time. When a query is made about the host, calculate this information based on the data stored for the last 5 minutes of data. You can infinitely build up this data structure and prune out old data by having old entries get deleted out of the map as new entries are being added.
  5. Effectively you can just hold on to the requests up to 10 minutes before they are processed, but that would violate other real-time constraints. The impact on memory usage is that the program uses up to 10 * O(n) as much memory as it would before this change, and it would delay the downstream service processing by up to 10 minutes. However if you redefine the problem such that the downstream services have a super-revisionist view of history and are okay with the results changing within that 10 minute processing window; you can just have this data continuously calcuated as a function applied to that infinite stream of data from the world.