28 Feb 2017, 15:51

BigQuery

A recent project involves a script to do a regular data slurp, process it, and write the results to Google BigQuery. The script runs once an hour via cron.

The data slurp is such that my script always requests a specific range of data. Thus if processing fails at any point, I can easily re-slurp the same data and process it again.

On the BigQuery side, however, I need to ensure data integrity. I need to ensure that no data are lost, nor are any data inserted twice. This must be accomplished without ever querying the BigQuery tables, as queries are expensive.

The strategy then is to fail hard during the data slurp and processing phases, so that if something goes wrong, nothing goes into BigQuery, and we try again in an hour. This works well for recovering from the occasional communication errors encountered during the data slurp.

On the other hand, an error during the BigQuery insert phase must not fail hard, as that would leave us in an indeterminate state of having some of our data written. Instead, BigQuery inserts that fail should be retried and retried again until they succeed. (Of course I need to make sure that the failures we’re retrying are transient, but that’s a separate topic.)

The Incident

Today in the log I found an “unknown error” entry, which means that something raised an exception in an unexpected place.

Inspecting the log file, I saw that one of the BigQuery insert calls had encountered a 500 (service temporarily unavailable) response. This was supposed to trigger an automatic retry, but the retry failed on account of one line of errant logging code. The script failed hard and marked the job as not done, even though several thousand rows had already made it into BigQuery.

On the next run an hour later, the script dutifully played catch-up, reprocessing the data that had gone astray and inserting it, this time successfully, into BigQuery.

So no data have been lost, but I’ve failed at preventing duplication.

Fortunately, we have have a timestamp on every insert, so it should be a relatively simple matter to manually delete everything that was inserted at that particular hour.

So imagine my surprise and confusion when I discovered that there were exactly zero records timstamped in that range. The logger clearly showed several batches of 500 inserts successfully completed before the crash; where had all the records gone?

As it turns out, it’s the insert ID that saved us. Each data point is sent with a unique insert ID which is generated as a function of the data itself. When BigQuery received insert IDs that it had seen before, it silently deduped the data for us.

Two observations to note:

The documentation states that BigQuery will remember the insert IDs for “at least one minute.” In our case, the duplicate data showed up an hour later and was still detected.
The deduping resulted in the earlier inserts being discarded and the later inserts being kept.

I’ve fixed the errant logging code, by the way.

23 Dec 2016, 14:49

A lean, clean Golang machine

Writing a Go package that interacts with a relational data store such as Postgres is full of messiness.

Those of us who appreciate the strong-typedness of Go probably also appreciate the strong-typedness of SQL, and vice versa. Unfortunately, communication between Go and SQL is less than ideal. This is due partly to the mostly free-form text format of data exchange (queries) and partly to some subtle differences in data types.

Database nulls are a particular headache, leading to the contortions of defining types such NullString, NullInt64, and NullBool, and an extra check is required every time you want distinguish a null from a zero value.

Why not use an ORM? There has been a lot written on this already, but in a nutshell, the level of generality required means that pretty much everything is an interface{} with runtime checks to cast stuff into the types you need, and at this point we’ve lost the benefits of Go’s strong typing and may as well write our whole application in Ruby.

I’ve found that programmers who appreciate the power and control that comes from writing in a low-level compiled language such as Go also appreciate the power can control that comes from writing queries yourself in SQL.

So what’s the problem, really?

The real headache of Go + SQL is the volume of boilerplate code that goes with even relatively simple operations.

(1) Run a query that doesn’t return any results.

_, err := db.Exec(query, ...args)
if err != nil {
	return err
}

(1a) Run a query that doesn’t return any results, but we want to know how many rows were changed.

res, err := db.Exec(query, ...args)
if err != nil {
	return err
}
count, err := res.RowsAffected()
if err != nil {
	return err
}

(1b) Run a query that doesn’t return any results, and we’d like to catch and process integrity violations (e.g. duplicate entry on a unique field). This one requires some database-specific code; the example here is for Postgres.

_, err := db.Exec(query, ...args)
duplicate := false
if err != nil {
	if pgerr, ok := err.(*pq.Error); ok {
		duplicate = pgerr.Code.Class().Name() == "integrity_constraint_violation"
	}
	if !duplicate {
		return err
	}
}

(1c) Run a query that doesn’t return any results, and we’d like to catch and process data exceptions (e.g. number out of range). This uses the same strategy as 1b and can be combined with it.

(2) Run a query that returns one row.

err := db.QueryRow(query, ...args).Scan(&arg1, &arg2, ... )
if err != nil {
	return err
}

(2a) Run a query that returns one row, and we’d like to catch and process the case where no rows are returned.

err := db.QueryRow(query, ...args).Scan(&arg1, &arg2, ... )
noRows := err == ErrNoRows
if err != nil && !noRows {
	return err
}

(3) Run a query that returns multiple rows.

rows, err := db.Query(query, ...args)
if err != nil {
	return err
}
defer rows.Close()
for rows.Next() {
	err := rows.Scan(&arg1, &arg2, ... )
	if err != nil {
		return err
	}
}
err = rows.Err()
if err != nil {
	return err
}

None of these is particularly bad as far as boilerplate goes, but unless we’re writing an ORM (and we’ve already decided we’re not), we’re going to have tens, perhaps hundreds of these scattered throughout our application. Add to that an other if err != nil every time we start a transaction, and I’m thinking there’s got to be a better way.

Organizing database access around high-level functionality

We would like to follow the unit of work pattern and create something akin to the session model of SQLAlchemy.

A simple example of a unit of work is a password reset, which checks for an email match, and then generates, saves, and returns a reset code. This will involve a minimum of two queries, which need to be in the same transaction. (Much more complicated units of work are possible, of course, both read-only and read-write.)

Our goal then is to find a way to have just one copy of all the boilerplate above and be able to substitute queries and argument lists as needed.

I’m going to propose that it’s straightforward to implement such a thing Go by defining a custom transaction handler which extends the one in database/sql. This is done within the package that uses it.

type Tx struct {
	sql.Tx
}

We extend sql.Tx with methods to (a) convert all database errors to panics so that we can catch and process them all in one place, and (b) easily iterate over result sets.

To accomplish (a), we add the methods MustExec, MustQuery, and MustQueryRow. These are identical to Exec, Query, and QueryRow except that they panic instead of returning an error code. Also, in the case of MustQuery and MustQueryRow, they return custom Rows and Row objects that have similar extensions.

To accomplish (b), we add the method Each to the custom Rows object returned by MustQuery. Method Each iterates over the result set and calls a callback function for each row.

The ourError type is used to wrap errors that we want to convert back to error codes. It distinguishes them from other kinds of panics (e.g. out of memory).

type ourError struct {
	err error
}

func (tx Tx) MustExec(query string, args ...interface{}) sql.Result {
	res, err := tx.Exec(query, args...)
	if err != nil {
		panic(ourError{err})
	}
	return res
}

func (tx Tx) MustQuery(query string, args ...interface{}) *Rows {
	rows, err := tx.Query(query, args...)
	if err != nil {
		panic(ourError{err})
	}
	return &Rows{*rows}
}

func (tx Tx) MustQueryRow(query string, args ...interface{}) *Row {
	row := tx.QueryRow(query, args...)
	return &Row{*row}
}

The custom Row and Rows types are defined analogously. Row is extended with a MustScan method:

type Row struct {
	sql.Row
}

func (row Row) MustScan(args ...interface{}) {
	err := row.Scan(args...)
	if err != nil {
		panic(ourError{err})
	}
}

Rows is extended with a MustScan method and also with the Each iterator described above.

type Rows struct {
	sql.Rows
}

func (rows Rows) MustScan(args ...interface{}) {
	err := rows.Scan(args...)
	if err != nil {
		panic(ourError{err})
	}
}

func (rows *Rows) Each(f func(*Rows)) {
	defer rows.Close()
	for rows.Next() {
		f(rows)
	}
	err := rows.Err()
	if err != nil {
		panic(ourError{err})
	}
}

Now to make it all work, we define a custom transaction function. It sets up the transaction, provides the custom transaction handler to our callback, and then catches the panics.

func Xaction(db *sql.DB, f func(*Tx)) (err error) {

	var tx *sql.Tx
	tx, err = db.Begin()
	if err != nil {
		return
	}

	defer func() {
		if r := recover(); r != nil {
			if ourerr, ok := r.(ourError); ok {
				// This panic of from tx.Fail() or the equivalent.  Unwrap it,
				// process it, and return it as an error code.
				tx.Rollback()
				err = ourerr.err
				if err == sql.ErrNoRows {
					err = ErrDoesNotExist
				} else if pgerr, ok := err.(*pq.Error); ok {
					switch pgerr.Code.Class().Name() {
					case "data_exception":
						err = ErrInvalidValue
					case "integrity_constraint_violation":
						// This could be lots of things: foreign key violation,
						// non-null constraint violation, etc., but we're generally
						// checking those in advance. As long as our code is in
						// order, unique constraints will be the only things we're
						// actually relying on the database to check for us.
						err = ErrDuplicate
					}
				}
			} else {
				// not our panic, so propagate it
				panic(r)
			}
		}
	}()

	f(&Tx{*tx}) // this runs the queries

	tx.Commit()
	return
}

This covers all of our boilerplate needs except for (1a) above. To accommodate (1a), we could extend sql.Result the same way we extended the others, but I haven’t really needed it yet, so I’ll leave it as an exercise for the reader.

One final method that’s there just to make everything neat and tidy is a Fail method on the transaction which can be used to return an arbitrary error.

func (tx Tx) Fail(err error) {
	panic(ourError{err})
}

The result

Our application code is now a lot neater.

err := Xaction(func(tx *Tx) {

	// Run a query that doesn't return any results.
	tx.MustExec(query1, ...args)

	// Run a query that returns one row.
	tx.MustQueryRow(query2, ...args).MustScan(&arg1, &arg2, ... )

	// Run a query that returns multiple rows.
	tx.MustQuery(query3, ...args).Each(func(r *Rows) {
		r.MustScan(&arg1, &arg2, ... )
	})
})

if err != nil {
	switch err {

	case ErrDoesNotExist:
		// query2 returned no rows

	case ErrInvalidValue:
		// data exception

	case ErrDuplicate:
		// integrity violation

	default:
		return err
	}
}

And since this is an extension to the stock transaction handler rather than a replacement for it, we can still use the original non-must methods for any edge case that might require a different kind of error handling.

21 Jun 2016, 17:20

CRUD APIs are crud

I’m making the case specifically about REST APIs, but in fact everything here applies to any API, REST or not.

It’s a common paradigm to create a data model as a collection of tables in a relational database and then access the data from some client app (mobile or web). CRUD has become a popular way to access the data, perhaps because it’s easy to make and easy to explain.

In CRUD, we’re essentially giving the caller direct access to INSERT, SELECT, UPDATE and DELETE commands on our SQL database. Or something analogous if you’re into NoSQL. It comes with some permissions checking, of course, but as far as the capabilities of the API, that’s pretty much it.

The worst thing this does is expose the schema to the client, making it difficult to change the internal structure later on. Want to fix how tags are stored? Too bad, you’re going to break the API.

Besides that, there’s a lot of database capability that’s missing.

What happens when we have some business logic, e.g. in a stored procedure? We’ll have to create a separate endpoint for that.

What happens when we have some limited resource that we need to allocate on a first-come, first-served basis, e.g. room reservations. Again, we need some special processing to ensure that only one of two simultaneous requests succeed.

What happens when we need some concept of transactions, that when a series of operations can’t be completed we revert back to the original state? Once again, we need to handle this separately.

What happens when we need to enforce some consistency between tables? In the case of foreign key constraints, it’s usually enough just to do the updates in the proper order, but other more complicated constraints will either need their own separate endpoints or will need to be momentarily violated. And being violated is never acceptable, even for just a moment.

The biggest problem with a CRUD API is that it’s shifting all the business logic to the caller, whereas it should instead be invisible to the caller. Even Microsoft recognized CRUD as an anti-pattern, and that was way back in 2005. Even when we’re only doing read and display, it’s often necessary to make several API calls to produce one document, unnecessarily slowing down load times.

The second-biggest problem with a CRUD API is specific to the update operation. Update does not represent any realistic use case. When do you ever want to rewrite an entire database row? We carry this mistake all the way to the UI, where we press edit on our profile, get back all of our data in input fields, change one field, and then write everything back.

APIs that work

I’m proposing a way to approach APIs, a way that avoids the pitfalls of CRUD. If you’re practicing domain-driven design (DDD), this will happen naturally. (Side note: at our company, we’ve been using DDD since day one, but no one here knew there was a buzzword for it.) None of what I’m proposing is new or groundbreaking; it’s just the way we should be doing things.

For read operations, there is one API call per display operation. Everything needed to render the requested view comes back as one bundle. Dynamic web content that’s generated server-side is done this way, and the API can too. As a bonus, we can use the same API as internal for server-generated pages and as external for client-generated views.

For write operations, there is a one-to-one correspondence between a user action and an API call. On the backend, one API call is one transaction, and if any part fails, then the whole thing fails. (Side note: one should never, ever build a system where it’s possible for only part of a user action to succeed. Usability nightmare.)

If we absolutely need some CRUD-style functionality (e.g. updating one’s profile), we should make our updates one field at a time. Not only does this match more closely what the average user will be doing, but it gives us an easy way to manage concurrency: simply require an update call to specify both the old and new value. If the old value doesn’t match, it’s an error.

Tracking changes and archiving

Tracking changes and archiving are two capabilities that are often added to a data store as an afterthought. I’d like to be proactive and incorporate them into the data design from the beginning.

The simplest way to track changes is with created-at and updated-at fields on every db model, and most database engines have neat ways to auto-update these fields. This level of tracking is of limited use, however, as we don’t know what changed or who changed it.

There are plenty of add-ons to do detailed revision tracking (django-reversion is one I like), but I’m a little bit concerned about the performance hit. Also, such add-ons make the created-at and updated-at fields redundant. That’s probably a good thing.

As for archiving, a common technique is to add a boolean field called archived to every model you want to be able to archive. On this plus side, it’s easy not to break references when you have non-archived data that refers to archived data, but we really shouldn’t have that happening. On the minus side, we end up adding and not archived to nearly every query.

We also might want to be able to permanently delete some archived material after a certain expiration time. We’d then need an archived_at field as well.

Here’s where CRUD fails again: Archive a record by setting archived to true and write it back. Unarchive it similarly. Determine the age of data by reading the created-at and updated-at fields on the model.

I propose that archiving and revision tracking can be implemented together in a way that’s clean and transparent to the client.

Instead of adding extra fields to the models, all the archive and tracking information goes into a read/append-only journal, which may or may not be implemented as a database table.

The journal contains one entry for each user action (see above). If there are system actions (e.g. daily aggregations) that get written to the database, those get included as well. Each entry contains a before-and-after detail of all changes. Since this before-and-after detail will only ever be accessed as a whole, it’s reasonable to make it one json bundle in a text field.

Archiving simply becomes a delete operation, as all the details are archived in the history. This means, of course, that related data needs to be archived together, which is a good thing. Furthermore, it’s trivial to put a time limit on data retention; simply delete old journal entries.

My next API is going to rock.

14 Apr 2016, 10:22

The Django REST framework

I may have to reconsider choosing Go for some server applications.

There’s a bit of a learning curve, but version 3 of the Django REST framework packs a lot of nice features. The web browsable API is the one that won me over.

16 Mar 2016, 16:47

Why I code in Go for server applications

I’ve written server applications in Ruby, Python, and Go. With Ruby I’ve tried out both Sinatra and Rails; with Python I’ve used Flask and Django; with Go I’ve used the net/http package.

There are endless arguments for and against using this framework or that language, and there are many valid reasons to like or dislike a set of tools. I personally like Django a lot. But Go has two features that beat the competition when it comes to writing web services: static typing and explicit error handling.

In Ruby, we often find ourselves having to check if a value is nil before processing it. Anything can be nil, and unexpected inputs often create nil values where we least expect. If we forget to check just one place in the code, sooner or later it shows up as a 500 error and our service is broken.

Test suites should cover this, but it’s just as easy to miss one edge case in a test suite as it is to miss one in the main code.

In Go, nothing can be nil (unless it’s a pointer, but it’s easy to know when a pointer might not have been initialized). In the case of unexpected input, a variable is set to its zero value (e.g. 0, '', {}), and the fact that there was unexpected input is conveyed separately.

In Python, we often find ourselves having to convert types, especially in the case of numerical inputs into string variables. Using a string where an int is required will raise a TypeError exception, and casting a non-numeric string to int will raise a ValueError exception. Here too, it’s all too easy to miss one try-except block and get a 500 error.

Again, test suites should cover this, but that means a test for every possible branch in the code. Again, it’s just as easy to miss one edge case in a test suite as it is to miss one in the main code.

In Go, compatible types are checked at compile time, thereby eliminating this source of errors.

I choose Go for the simple reason that most 500-inducing code bugs can be either caught at compile time or avoided entirely. The result is faster and more stable deployments than the alternatives.

OK, I lied. I choose Go because I like it. But this is a great way to justify my choice.

03 Feb 2016, 17:51

A simple sentiment analysis of two US presidential candidates

Goal: To do some basic sentiment analysis on video content.

Test cases: Two 5-minute clips of US presidential candidate speeches.

Strategy:

Extract 5 minutes of audio from the beginning of each video.
Generate a transcript using a speech-to-text program.
Feed the transcript into a sentiment analyzer.

The original content

Extracting a 5-minute audio clip

There are many ways to do this. One way is to download the video using a browser add-on. Browser add-ons are easy to find but are also fickle, as they make it easy to download material in violation of copyright. And if you’re downloading from YouTube, you’re violating their terms of service, even if you’re not infringing on copyright. (We maintain that this exercise falls under fair use.)

Another way is to turn on audio capture while playing the video.

After the capture is complete, we’ll want to convert to FLAC if we’re not there already. We’ll use ffmpeg for this, e.g.:

ffmpeg -i captured-content.mp4 captured-content.flac

And then to extract the first 5 minutes:

flac --until=5:00 captured-content.flac -o five-minute-clip.flac

Both ffmpeg and flac are available via homebrew.

Generating a transcript

The IBM Watson Developer Cloud has a speech-to-text service which is available through an API and also has a demo page. In theory, one can get limited free access to the API after going through a mildly annoying sign-up process, but in practice I was unable to convince the API to accept the credentials I’d obtained.

Fortunately, the demo page allows file uploads and produced the following transcripts from the content above:

Trump

This is no way I’m leaving South Carolina. And I was gonna leave for tonight come back as it upsets up saying you have a five days we got a win on Saturday we’re going to win. Make America great again we’re gonna make America. We’re going to win. You know. It’s been an amazing friend Ruben all over the state today I love you too girly looking out dnmt love you I love you all. I love you. So many things are happening for a country. And it’s this is a movement time magazine last week at the most beautiful story cover. And they talk about its improvement they’ve never seen they say there’s not been anything like this I don’t know ever but they actually say ever. We went to Tampa the other day Tampa Florida would like two days notice fifteen thousand people that are turned away five thousand and by the way for all of the people in this room I can’t believe it this is a huge room but downstairs to filling up another one and there sadly sending people away we don’t like that right. No okay why don’t we all get up go let’s have that now. Now we have one of the great veterans outside I’ll stand up while one of the great great you are great. Love this guy. He loves the veterans and I love the veterans are we going to take care of our veterans I’ll tell you that we’re going to take it did not. They are not properly taken care of so we’re going to take a right we have sent a look at this. I knew you guys would say that I can spot a veteran a long ways off. But we are we going to take a break here we’re gonna take you have a military we’re going to take you have a military because our military is being whittled away whittled away we’re going to make our military so big so strong so powerful nobody’s going to mess with us anymore nobody nobody. Nobody. So Nikki Haley a very nice woman she better speech the other day you saw that and she was talking about anger and she said there’s a lot of anger and I guess she was applying all of us you know really referring to us. And by the end of the day she was actually saying that Donald Trump is a friend of mine he’s been a supporter of mine everything else you know the tone at the beginning was designed by the time that she was just barraged with people she said I think we better change our path here. And by the end of the day and it was fun it was great but she said you know there is anger but I said there is a group and I was asked during not this debate but the previous debate I was asked. I by the way did you love this last debate dnmt. Listen to like. They came at me from every angle possible. Don’t know they came out before every angle you know sort of interesting. They were hitting me with things like and such until as you know I never realized I’ve always don’t politicians it is honest but I’ve never known the level of dishonesty. And I deal in industries a lot of different but mostly real estate and like in Manhattan at different places but I’ve never seen people as dishonest as politicians. They will say anything. Like okay so a lot of you people understand that you you get when you’ve seen the speeches that you see in a lot of it and you know that I protect the second amendment more than anybody by far dnmt more than. And this guy Ted Cruz gets upset Donald Trump does not respect a second amendment and the more that anybody I’m with the second amendment. I saw no no it’s lies. And then they do commercials and you know he did it to Ben Carson and him in particular in all fairness. Jeff is represents but these are minor misrepresentations and he’s not going anywhere anyway so what would how casual. Not as. Well Jeb was talking about eminent domain Donald Trump used eminent domain privately then I see there’s a big story I had to bring this out luck. Proof Jeb bush under eminent domain took a disabled veterans property. Something about me. No state. Honestly these guys are these guys are the worst. Eminent domain without eminent domain by the way you don’t up highways roads airports hospitals you know not bridges you drive anything so. They say Donald Trump does look like Eminem and I don’t even tell me but you need to road you need a highway you need you know it’s funny they all want they all want the keystone pipeline right but without eminent domain without think of it without eminent domain you can’t have the keystone pipeline and we’re going to get the keystone pipeline approved but but fluids jumps. It’s jobs but remember this when it gets approved a politicians go to baby approve it.

Sanders

President Falwell and. David. Ok thank you very much for inviting my wife Jane and. Ought to be with you this morning we appreciate the invitation. Very much. And let me start off by acknowledging what I think. All of you already know. And that is the views. That many here at liberty university have. And all I. On a number of important issues. A very very different. I believe in women’s rights dnmt. In the light of the woman to control her own body dnmt. I believe in gay rights. And now. Those of my views. And it is no secret. But I came here today. Because I believe from the bottom of my heart. That it is vitally important for those of us. Who hold different views. To be able to engage in any civil discourse. Who often in our country and I think both sides. Bear responsibility for us. There is too much shouting at each other. There is too much making fun of each other. Now in my view then are you can say this is somebody who whose voice is hoarse because I have given dozens of speeches. And the last few months it is easy. To go out and talk to people who agree with you are missing Greensboro North Carolina just last night. Alright. We are nine thousand people out. Mostly they agreed with me tonight. We’re going to be a Manassas and thousands out they agree with me. It’s not a whole lot to do. That’s what politicians by and large do we go out and we talk to people who agree with us. But it is harder. But not less important. For us to try and communicate with those who do not agree with us on every issue. After. And it is important to see where if possible and I do believe it’s possible we can find common grounds. No liberty university. Is a religious school obviously. Pn. All of you are proud of the. You already school. Which as all of us in our own way. Tries to understand the meaning of morality. What does it mean. To live a moral life. And you try to understand in this very complicated modern world that we live in. What the words of the Bible me in today’s society. You are in school which tries to teach its students. How to behave with decency and with honesty and how you can best relates. To your fellow human beings and I applaud. You for trying to achieve those goals. Let me. Take a moment. Or a few moments. To tell you what motivates me. And the work that I do. As a public servant as a Sentinel. From the state of Vermont. And let me tell you that it goes without saying I am flaws foh throw me being a perfect human being. But all I am motivated by a vision.

Sentiment Analysis

Again, there are a variety of tools to do this, including the Natural Language Toolkit Project, a free python library. Taking advantage of a simple demo site which uses the NLTK, we can see that both Sanders and Trump are polar, but Sanders is more positive. Who would’ve known?

Trump

Overall: negative
Subjectivity
- neutral: 0.2
- polar: 0.8
Polarity
- pos: 0.4
- neg: 0.6

Sanders

Overall: positive
Subjectivity
- neutral: 0.2
- polar: 0.8
Polarity
- pos: 0.8
- neg: 0.2

For the adventuresome, here are more detailed instructions on using the NLTK for sentiment analysis.

03 Feb 2016, 17:51

Using Grunt to Manage Static Assets

I previously posted about using GNU Make to manage front-end assets for a website. A colleague suggested that I should check out Grunt as it does everything I need to do and more. So here it is.

I have the same goals as I did last week:

concatenate an arbitrary combination of js files, minifying them in the process
preprocess css with sass
copy directories i and lib untouched
run a watch process to update files as they’re changed

Installing grunt

Grunt is part of the node.js ecosystem, and as such is available via the node package manager (npm). Npm is available on OS X via Homebrew.

Basic npm concepts

There are a few things that we need to understand about npm. The biggest headache was recognizing the difference between local and global installs and knowing when to use which.

Npm installs packages into a project (unless the -g global option is specified, more on that later) and needs to be run in project root. Packages then go into a subdirectory called node_packages.
If you’re in some other directory when running npm, the packages will go into a node_packages subdirectory there and confuse you.
Npm expects to see a file called package.json in the project root directory and complains if it’s not there. package.json includes a list of packages that the project depends on, and the default npm install without any parameters installs those packages.
When installing a package explicitly, there is in an option to add an entry to package.json so that someone else will be able to use npm install and get everything. Note that this is an option and not the default behavior.

Creating the package.json file

According to the documentation, the command to use is npm init, and it must be run in project root. Running it starts a dialog on the terminal, asking some mostly irrelevant questions: name (defaults to the name of the project directory), version (defaults to 1.0.0), description, entry point (defaults to index.js), test command, git repository, keywords, author, and license (defaults to ISC). These questions can be suppressed by using npm init --yes, which defaults everything.

Unfortunately, npm will complain if it doesn’t see a description, a repository field and a license field. The defaults only cover the license field, leaving the description blank and the repository field missing altogether.

The minimum package.json has just a name and a version. But since I’m a stickler for getting rid of warnings, I’m going to have to create my own package.json that includes name, version, description, repository and license. None of this information is relevant; its only purpose is to make the warnings go away.

{
  "name": "taco",
  "version": "1.0.0",
  "description": "xyz",
  "repository": {
    "type": "git",
    "url": "xyz"
  },
  "license": "ISC"
}

Unfortunately there’s one warning I can’t get rid of. At the time of this writing, npm install grunt produces this:

npm WARN deprecated lodash@0.9.2: lodash@<2.0.0 is no longer maintained. Upgrade to lodash@^3.0.0

According to the changelog for lodash, version 0.9.2 was released in 2012, and the current version is 4.0.0. Even the “upgrade to” version of 3.0.0 is a year old already. This is a red flag; how and why are these dependencies not getting maintained? That said, it appears that an update is on the way. Will have to ignore this warning for now.

Grunt plugins

Grunt itself is just the overlord; to do any real work we’re going to need some plugins. After a lot of googling, I’ve come up with this list:

To minify and combine javascript files, we can use grunt-contrib-uglify.
To compile scss into css, we can use grunt-contrib-sass.
To copy directories, we can use grunt-contrib-copy.
To delete old files, we can use grunt-contrib-clean.
To watch for changes and recompile, we can use grunt-contrib-watch.

All of these are marked as officially maintained, giving us the warm, fuzzy feeling that everything is going to work.

We can now install grunt and the plugins.

npm install grunt grunt-contrib-uglify grunt-contrib-sass grunt-contrib-copy grunt-contrib-clean grunt-contrib-watch --save-dev

Grunt command line

There is one more install required if we are to be able to run grunt from the command line. The package is grunt-cli, and needs to be installed globally so that the grunt executable goes into /usr/local/bin and is available in the system path.

npm install grunt-cli -g

It’s possible to install grunt-cli in the project directory, but then the executable will be in node_modules/.bin instead of /usr/local/bin, and that makes more headaches for us

One gotcha is that the global grunt-cli requires a local grunt or it will fail. Grunt-cli is a wrapper to find the locally installed grunt to whatever project you’re in. The global grunt-cli will not find a global grunt.

Summary of grunt installation

Install npm (e.g. brew install npm).
Create the package.json file shown above.
npm install grunt grunt-contrib-uglify grunt-contrib-sass grunt-contrib-copy grunt-contrib-clean grunt-contrib-* watch --save-dev
npm install grunt-cli -g

package.json should go into source control, and node_modules should be excluded from source control with the appropriate entry in .gitignore.

Once we have package.json as updated by the npm install –save-dev command, steps 2 and 3 can be replaced by a simple npm install. We still need to keep step 4; global packages can’t go into package.json (npm will ignore --save-dev when -g is specified).

Optionally installing grunt-cli locally

Installing grunt-cli locally instead of globally will allow it to be included in package.json, but it has the side effect of not having the grunt executable in the path. A possible workaround to this side effect is to add a script section to package.json with all the grunts you want to do.

"scripts": { "watch": "grunt watch" }

Then you can type npm run watch instead of grunt watch. This may or may not be worth the trouble.

Writing a gruntfile

Basic gruntfile concept

The gruntfile is a bit of javascript initialization that gets run whenever grunt is invoked. The gruntfile needs to define an initialization function and assign that to the global module.exports. Within the initialization function, we’ll need to list the modules we need (grunt-contrib-uglify, etc.), specify some configuration for each module, define the default task, and optionally define additional tasks.

Each plugin defines a task of the same name as the plugin (e.g. grunt-contrib-uglify defines an “uglify” task, under which any number of subtasks may be defined).

The gruntfile is named Gruntfile.js and resides in project root. The basic gruntfile structure is:

module.exports = function(grunt) {
  grunt.initConfig({
    pluginname: { ... }  // one of these for each plugin
  };
  grunt.loadNpmTasks( ... );  // one of these for each plugin
  grunt.registerTask('default', ... );  // define the default behavior of `grunt` with no parameters
  grunt.registerTask( ... );  // optional additional tasks
}

Each plugin defines a task of the same name as the plugin (e.g. grunt-contrib-uglify defines an “uglify” task, under which any number of subtasks may be defined). Defining additional tasks is useful for combining tasks into a single command.

A thorough read of the docs along with some examples gives us enough information to build a single gruntfile, giving us the following commands:

grunt does a clean build, deleting pub if it exists and building everything from src.
grunt build does an incremental build of js and css files, updating only those files whose source has changed.
grunt copy syncs the directories i and lib from src to pub.
grunt watch runs until you kill it, watching for changes in src and updating pub as necessary.

Note that grunt is short for grunt all, which does grunt clean + grunt copy + grunt build.

Observations

Overall, the quality of documentation is poor. I had to resort to copying examples and then modifying them by trial and error until I got the results I wanted. There are many alternate syntaxes, causing further confusion.
Could not find a way to do incremental updates with uglify. The entire js collection is rebuilt whenever any js source file changes.
The sass plugin depends on having command-line sass installed as a ruby gem, a dependency that I grudgingly accepted when writing the previous makefile and was hoping to avoid.
Dependencies from @import statements in scss source files are handled nicely; the dependencies are honored when doing an incremental build and don’t need to be included in the gruntfile. This is nice.
The grunt-contrib-copy plugin doesn’t know how to sync. The i and lib directories are copied in their entireties every time there’s a change. There is another plugin which claims to know how to sync, but I haven’t tested it.

Conclusion

This was a whole lot of trouble to set up a relatively simple build system. Grunt is a powerful tool, and I can see the value of using it when you’re already in a node-based project, but it to use it as an isolated build tool is not worth the effort.

The only thing we gained with Grunt is the ability to auto-detect imports in .scss files and do incremental updates accordingly. At the same time we lost the ability to incremental updates of the Javascript files, at least with the standard plugin.

I was also hoping to avoid the ruby sass dependency by using the plugin, but no luck there since the plugin is just a wrapper for the command line sass.

14 Jan 2016, 14:22

Static assets for websites

Count me in on the developers who believe that GNU make is the best tool for assembling static assets.

The general problem

We need to maintain a set of files B that is derived from another set of files A through some known (and possibly complicated) transformation. We edit the files in set A but not in set B. We would like a simple way to (1) create B from A, and (2) update B when A changes, only recreating the parts that are necessary.

The more specific problem

B is the set of static assets for a web service, and A is the set of source files used to make them. Only A will be checked into source control, and only B will be uploaded to the web server.

There are different kinds of assets in A that need to be treated differently.

Javascript

My Javascript source files are formatted nicely and full of meaningful, well-thought-out comments. I would like the js files sent with the web pages to be devoid of comments and mashed together so as to be almost unreadable. This can be accomplished by piping the files through JSMin on the way from A to B.
My Javascript source files are modular, and one page may need several files. These are best combined into one file for faster loading. Also, any source file could be included in several combination files. I would like the ability to have each js file in B created from an arbitrary combination of source files from A.

CSS

All my css is written as scss and needs to be processed with an scss compiler such as Sass. Scss files may import other sccs files, a fact we need to be aware of when detecting changes.

Other assets such as images and precompiled libraries can be copied from A to B without modification.

What to do

The first thing is to define a directory structure.

For set A we’ll make a subdirectory src in project root with four subdirectories: js for Javascript sources, css for scss sources, i for image files, and lib for precompiled libraries.

For set B we’ll make a subdirectory pub in project root. Compiled js and css files will go directly in pub, and the two subdirectories i and lib will mirror src/i and src/lib.

.
├── src
│   ├── js
│   ├── css
│   ├── i
│   └── lib
└── pub
    ├── i
    └── lib

Next we need to make a list of the js and css files we would like generated and placed into pub. We’ll do that by defining variables JSFILES and CSSFILES, e.g.:

JSFILES := main.js eggs.js pancake.js
CSSFILES := blueberry.css yogurt.css

After that, we need to define the dependencies for each of these files, e.g.:

pub/main.js: src/js/main.js
pub/eggs.js: src/js/eggs.js src/js/milk.js
pub/pancake.js: src/js/milk.js src/js/flour.js src/js/eggs.js

pub/blueberry.css: src/css/blueberry.scss src/css/fruit.scss
pub/yogurt.css: src/css/yogurt.scss

To simplify things, we’ll define the default dependeny to be one source file of the same name, so we can omit dependency definitions for main.js and yogurt.css. We’ll also define JS := src/js, CSS := src/css and PUB := pub.

$(PUB)/eggs.js: $(JS)/eggs.js $(JS)/milk.js
$(PUB)/pancake.js: $(JS)/milk.js $(JS)/flour.js $(JS)/eggs.js
$(PUB)/blueberry.css: $(CSS)/blueberry.scss $(CSS)/fruit.scss

Finally, we need to make a list of directories to be copied directly from src to pub.

COPYDIRS := lib i

This is now enough information for us to build a simple makefile, giving us (at least) the following commands:

make does a clean build, deleting pub if it exists and building everything from src.
make build does an incremental build of js and css files, updating only those files whose source has changed.
make copy syncs the directories i and lib from src to pub.
make watch runs until you kill it, watching for changes in src and updating pub as necessary.

Note that make is short for make all, which does make clean + make copy + make build.

How it works

The meat of this makefile is in the pattern rules (lines 43-55). Quick cheat sheet: $@ = target, $^ = all dependencies, $< = the first dependency. Details are here.

The first rule takes care of main.js and eggs.js.

The second rule takes care of pancake.js. Note that pancake.js doesn’t match the first rule because there is no source file called pancake.

The third rule takes care of blueberry.css and yogurt.css. Note that on line 55 fruit.scss is not supplied as an argument to sass. It’s only listed as a dependency because blueberry.scss contains an @import "sass"; directive.

Finally, lines 32-36 take care of syncing directories i and lib.

In the end, our filesystem looks like this:

.
├── src
│   ├── js
│   │   ├── eggs.js
│   │   ├── flour.js
│   │   ├── main.js
│   │   └── milk.js
│   ├── css
│   │   ├── blueberry.scss
│   │   ├── fruit.scss
│   │   └── yogurt.scss
│   ├── i
│   │   ├── hanjan.jpg
│   │   └── ikant.png
│   └── lib
│       └── MooTools-Core-1.5.2-compressed.js
└── pub
│   ├── i
│   │   ├── hanjan.jpg
│   │   └── ikant.png
│   ├── lib
│   │   └── MooTools-Core-1.5.2-compressed.js
│   ├── blueberry.css
│   ├── eggs.js
│   ├── main.js
│   ├── pancake.js
│   └── yogurt.css
└── Makefile

Dependencies

This makefile requires jsmin, sass and watchman-make to be available at the command line.

Jsmin and Watchman (which includes watchman-make) are available on OS X via Homebrew. Sass is not (yet), but it can be installed as a system-wide ruby gem. I’m not a fan of requiring rubygems for my decidedly anti-rails build system, but since Sass runs nicely from the command line I’ll turn a blind eye for now.

Jsmin is also available via npm.

Other features I’d like to include

Would be nice to automatically detect @import statements in scss source files and generate dependency lists based on that. I’m aware that the Sass package has it’s own watcher that handles dependencies, but using that would mean bypassing a significant part of the makefile, thereby making a mess.

It would be pretty simple to add a make deploy command to rsync the server. I’ll probably do that later.

A feature I excluded on purpose

Many web frameworks automatically append timestamps or version numbers to static assets in order to defeat browser caching. This adds a whole lot of complexity for a pretty minor benefit. Once a site is in production, I expect updates to be few and far between, and I’m happy to manually add a version number to a target filename as necessary.

Credits

This Makefile was heavily influenced by and owes thanks to this blog post. Thank you!

08 Jan 2016, 15:50

Google Sign-in

Disclaimer: Read the docs too. This post doesn’t cover everything.

A week ago I was completely clueless as to how Google sign-in works. I set out to write about it and learned a few things.

Overview

Using Google sign-in on a website requires first doing the following in the Google developer’s console:

creating a project
creating a sign-in client ID for that project
associating the domain(s) of the website with the sign-in client ID

Sign-in is done using javascript on the web page to talk directly to Google’s servers. The javascript is loaded from Google’s servers. It is not necessary to involve the server for website X at all.

When Joe the Hacker attempts to sign in to website X, a popup dialog appears. The contents of the dialog depend on Joe’s current signed-in state.

If Joe is not signed in to Google at all, then a sign-in dialog appears. If he’s signed in to more than one account, then an account chooser dialog appears. If he’s signed into exactly one account, then the sign-in part is skipped.

If this is the first time he’s attempted to sign in to website X, then he’ll be asked to give permission for website X to have access to his profile information (name, picture) and email address.

In the case that Joe needs neither the sign-in dialog nor the permissions dialog (i.e. he’s already signed in to exactly one account and is a returning user), then the pop-up closes itself immediately without any user action.

The browser remembers that Joe is signed in to website X using Google sign-in. He can sign out of website X and still be signed into Google. However, if he signs out of Google, then he’ll automatically be signed out of website X as well. He can’t be signed in to website X using his Google ID and not also be signed in to Google.

If the webpage making the sign-in call is served from a domain that has not been registered in the console, then Joe will see a 400 error (redirect_uri_mismatch) and a picture of a broken robot when trying to sign in. The error page also exposes the email address of the account that the project is made under.

Javascript details

The file platform.js provides the global Google API object called gapi and the auth2 module. The auth2 module must be explicitly loaded into gapi with the gapi.load method before it’s used. This method provides an optional callback for when/if the module is loaded successfully.

gapi.load("auth2", callback);

Once the module is loaded, it must be initialized with the sign-in client ID (see above). The client ID may either be provided as an option to the init method or in a meta tag in the document. The init function returns a GoogleAuth object.

gauth = gapi.auth2.init(options);

A logical initialization flow would be to have the initialization in the load callback.

gapi.load("auth2", function() { gapi.auth2.init(); });

The GoogleAuth object may also be obtained any time after it’s initialized using the getAuthInstance method.

gauth = gapi.auth2.getAuthInstance();

At this point we can find out whether Joe is or isn’t signed in to website X.

if (gauth.isSignedIn.get()) { alert("User is signed in, and I debug with alerts."); }

If he isn’t signed in, we can try to sign him in. This is done with GoogleAuth.signIn. In most cases, we should wait for him to click a button (e.g. “Sign In with Google”) before doing so. There are a few options that we can ignore for now.

gauth.SignIn(options)

This function returns a Promise, making it easy to do stuff when it finishes.

gauth.SignIn().then(function() {
    alert("We're in!");
}, function() {
    alert("You FAILED!");
});

There is also this thing called a GoogleUser object, which we can get after Joe has signed in. (We can also get it before he’s signed in, but it would be useless.) The GoogleUser object reveals Joe’s name, email address and profile picture if he has one.

user = gauth.currentUser.get();
profile = user.getBasicProfile();
alert(profile.getName() + " " + profile.getEmail());

Finally, use can use GoogleAuth.signOut to sign Joe out of website X.

gauth.signOut();

Although this signs him out, it doesn’t forget about him. The GoogleUser object is still available and still has his ID.

user = gauth.currentUser.get();
alert(user.getId());
alert(user.isSignedIn());    // false

And we can sign him in again using the grant method. This behaves slightly differently from GoogleAuth.signIn in that it doesn’t give Joe a chance to choose a different account.

user.grant();

According to the documentation, GoogleUser.signIn is equivalent to GoogleUser.grant, but my Firebug tells me that there’s in fact no method called signIn on the GoogleUser object. Bad documentation.

What usually happens in practice

Google provides a handy shortcut so that anyone can add Google sign-in to their page without knowing any javascript. As soon as platform.js is loaded, it checks for the existence of a div.g-signin2 (<div class=“g-signin2”>) and springs into action if found.

It loads the auth2 module and puts a standard sign-in button in the div (clobbering what was there, watch out), and wires up the button so that when you press it, it logs you in. It can also call a function of your choosing on sign-in, passing in Joe’s GoogleUser object.

The button appears to be rendered with gapi.signin2.render, and specifying options like this works as expected:

<div class="g-signin2" data-onsuccess="onSignIn" data-longtitle="true" data-width="200"></div>

One caution here is that data-longtitle is a string value that gets cast to a boolean, and, as such, the string “false” will get cast to true. The way to make longtitle false is set data-longtitle="" (or to omit it entirely, as false is the default).

Involving the server for website X

Chances are website X will want to know about Joe’s sign-in on the server side. This will require some javascript to send a token to the server, which the server can then decode to get Joe’s information.

The token can be obtained from the GoogleUser object.

token = user.getAuthResponse().id_token

The token is signed to prevent spoofing, and it’s up to the server code to verify its integrity.

Scopes

Scopes are the permissions that we’re asking Joe to give to website X. The default for sign-in is scopes profile and email, and we get Joe’s name, email, and picture if available. (Nerdy detail: the name and picture are not part of any scope; see below.)

With the default settings, Joe will see the a request for the following permissions the first time he attempts to sign in to website X:

Know who you are on Google
View your email address
View your basic profile info

If other permissions (e.g. calendar) are needed, they can be added with the scope option in GoogleAuth.init, the scope option in GoogleAuth.signIn, or the data-scope tag on div.g-signin2. Most scopes are expressed as urls, e.g. "https://www.googleapis.com/auth/calendar.readonly”.

Here is a list of all known scopes.

Fewer Scopes

The default scopes can be turned off with by sending fetch_basic_profile: false as one of the options to GoogleAuth.init. (Note that this precludes using the auto magic button.) In this case, at least one scope must be explicitly specified with the scope parameter.

The two scopes that are included by default are profile and email. Adding the profile scope does nothing. It only provides Joe’s 21-digit ID, which we get anyway. Adding the email scope gives us Joe’s email address, but the only way to get it is to decode id_token, as GoogleUser.getBasicProfile only works when fetch_basic_profile is true.

Strangely, requesting only the email scope causes permission requests #1 and #2 to be displayed, even though we only get the email access. Requesting only the profile scope causes only #3 to be displayed. Requesting both email and profile displays #2 and #3 but not #1. I have yet to understand the logic behind this behavior.

Also strange is the fact that neither the profile nor email scope provides Joe’s name and picture. As far as I can tell, the only way to get his name and picture is to stick to the default fetch_basic_profile: true.

05 Jan 2016, 11:45

Happy new year 11111100000

Or, happy new year (2⁶⁻¹)(2⁶−1).

We just wrapped up a long-running big data project. While I didn’t find the project itself particularly interesting, I learned quite a bit of interesting stuff doing it.

On the skills side, I learned Go, learned how to navigate AWS, and gained a good understanding of how cluster data stores work.

But probably the most important takeaway from last year is summarized neatly in this blog post from 2010, recently sent to me by a coworker. In a nutshell, there are a lot of big data systems that are far larger and more complicated than they need to be for the data that they are designed to process.

awm's blog

What did I do today?

28 Feb 2017, 15:51

The Incident

23 Dec 2016, 14:49

So what’s the problem, really?

Organizing database access around high-level functionality

The result

21 Jun 2016, 17:20

CRUD APIs are crud

APIs that work

Tracking changes and archiving

14 Apr 2016, 10:22

16 Mar 2016, 16:47

03 Feb 2016, 17:51

The original content

Extracting a 5-minute audio clip

Generating a transcript

Sentiment Analysis

03 Feb 2016, 17:51

Installing grunt

Basic npm concepts

Creating the package.json file

Grunt plugins

Grunt command line

Summary of grunt installation

Optionally installing grunt-cli locally

Writing a gruntfile

Basic gruntfile concept

Observations

Conclusion

14 Jan 2016, 14:22

The general problem

The more specific problem

What to do

How it works

Dependencies

Other features I’d like to include

A feature I excluded on purpose

Credits

08 Jan 2016, 15:50

Using Google Sign-in on Website X

Overview

Javascript details

What usually happens in practice

Involving the server for website X

Scopes

Fewer Scopes

05 Jan 2016, 11:45