A recent project involves a script to do a regular data slurp, process it, and write the results to Google BigQuery. The script runs once an hour via cron.
The data slurp is such that my script always requests a specific range of data. Thus if processing fails at any point, I can easily re-slurp the same data and process it again.
On the BigQuery side, however, I need to ensure data integrity. I need to ensure that no data are lost, nor are any data inserted twice. This must be accomplished without ever querying the BigQuery tables, as queries are expensive.
The strategy then is to fail hard during the data slurp and processing phases, so that if something goes wrong, nothing goes into BigQuery, and we try again in an hour. This works well for recovering from the occasional communication errors encountered during the data slurp.
On the other hand, an error during the BigQuery insert phase must not fail hard, as that would leave us in an indeterminate state of having some of our data written. Instead, BigQuery inserts that fail should be retried and retried again until they succeed. (Of course I need to make sure that the failures we’re retrying are transient, but that’s a separate topic.)
The Incident
Today in the log I found an “unknown error” entry, which means that something raised an exception in an unexpected place.
Inspecting the log file, I saw that one of the BigQuery insert calls had encountered a 500 (service temporarily unavailable) response. This was supposed to trigger an automatic retry, but the retry failed on account of one line of errant logging code. The script failed hard and marked the job as not done, even though several thousand rows had already made it into BigQuery.
On the next run an hour later, the script dutifully played catch-up, reprocessing the data that had gone astray and inserting it, this time successfully, into BigQuery.
So no data have been lost, but I’ve failed at preventing duplication.
Fortunately, we have have a timestamp on every insert, so it should be a relatively simple matter to manually delete everything that was inserted at that particular hour.
So imagine my surprise and confusion when I discovered that there were exactly zero records timstamped in that range. The logger clearly showed several batches of 500 inserts successfully completed before the crash; where had all the records gone?
As it turns out, it’s the insert ID that saved us. Each data point is sent with a unique insert ID which is generated as a function of the data itself. When BigQuery received insert IDs that it had seen before, it silently deduped the data for us.
Two observations to note:
- The documentation states that BigQuery will remember the insert IDs for “at least one minute.” In our case, the duplicate data showed up an hour later and was still detected.
- The deduping resulted in the earlier inserts being discarded and the later inserts being kept.
I’ve fixed the errant logging code, by the way.