Tuesday, August 14, 2012

Unindexed Entity properties with Bulk Loader

Bulk Loader is a really nice feature of Google App Engine. The only thing I could not find is how to upload unindexed properties.

Fortunately, there's a post_import_function hook that can be run on every entity just before it gets uploaded. The code is really just a couple lines:

(goes into something like postimport.py)

from google.appengine.api import datastore

_UNINDEXED_PROPS = {
  'MyModel': ['prop1', 'prop2', 'prop3'],
  'MyOtherModel': ['some_prop']
}

def unindex_properties(input_dict, entity, bulkload_state_copy):
"""Runs on after every entity import and sets correct
unindexed properties.
"""
if isinstance(entity, datastore.Entity):
  kind = entity.kind()
  if kind in _UNINDEXED_PROPS:
    props = _UNINDEXED_PROPS[kind]
    entity.set_unindexed_properties(props)
return entity

And here's a sample bulkloader.yaml snippet:

python_preamble:
- import: base64
- import: re
- import: google.appengine.ext.bulkload.transform
- import: google.appengine.ext.bulkload.bulkloader_wizard
- import: google.appengine.ext.db
- import: google.appengine.api.datastore
- import: google.appengine.api.users
- import: postimport

transformers:
- kind: MyModel
  connector: simplexml
  connector_options:
    xpath_to_nodes: /whatever
    style: element_centric
  post_import_function: postimport.unindex_properties
  property_map:
  - ...props definitions...


Gist: https://gist.github.com/3351222

Wednesday, May 23, 2012

Google APIs, Authentication and App Engine

Before you decide whether it's worth reading: this is a little overview on Google APIs and client libraries as notes to myself. Although this post is focused on Java, the concepts remain the same for equivalent libraries in Python.

Data format

So, when I say Google API what does it mean exactly? Long time ago, in the dinosaurs era, right after the Big Bang (UPDATE: how about String theory and Multiverse? :) just kidding), Google started exporting their awesome services to the world. Spreadsheet API, Calendar API, Picasa Web Albums API just to name a few. Today, there's an API Directory page. Although it's a damn huge amount of different services, most of them actually have a lot in common: the data format those API operate with. That's where GData term comes from. The data were based on Atom (XML) format and the data exchange protocol was called Google Data Protocol.

New APIs started appearing, but not all of them followed the GData convention. For instance, Prediction API uses JSON format, and so do many others being released recently. Meanwhile, "old" APIs, like Spreadsheets API, began supporting "alternative" formats (e.g. JSON, on the contrary to AtomSub).

So, today Google Data Protocol is actually just a name indicating that it is a data format related to Google (Data) APIs. On some pages you might see a disclaimer saying "Most newer Google APIs are not Google Data APIs". Generally speaking it just means that "newer Google APIs" do not come from the age of AtomSub (XML), they started right from an alternative (e.g. JSON) formats.

API Usage (libraries)

So, how do you use a Google service programmatically? Normally you have two options:

  • read data format/protocol reference and implement the bits of the API you're interested in yourself (who said you need absolutely all of what a particular service can do for your specific needs?)
  • use a library already written by someone else.
In some cases it is simpler and faster to implement an API "client" yourself based on a published description of data format/exchange protocol, especially if you're planning on using a similar functionality offered by different service providers. This is what I did when I wrote SimpleAuth.


Now, say I want to access my Spreadsheets, Docs List and Picasa Albums programmatically. In this case I wouldn't go for a SimpleAuth approach but use something already written, an API client. Fortunately, there's lot of good stuff out there. In fact, there are so many libraries that, at some point, I felt a little lost.

This brings me to the point of this post. I wanted to make a list of libraries that can help me connect with my data residing on Google's clouds. These are the two written by Google:


So, what's the difference, why do we have two libraries.

Well, GData client is where it started, from the times of the "older Google APIs". For instance, you'll find a Java class representing Atom feed Entry of GData protocol.

On the contrary, Google API Java client is a newer library, supporting formals like JSON, OAuth 2.0 protocol (more on that later) and Android from the start. This library was not written "from scratch" though. Instead, it heavily based on another two libraries:


Which is the one for you, GData Client or Google API Client? It's a good question. In fact, there's a nicely written pros/cons wiki page called Migrating to Google API Java Client.

Let's talk about authentication and authorization now and get back to the two libraries in a minute.

Authentication and Authorization

Recall that, when you access a Google Spreadsheet doc or see a list of your documents on Google Drive using a browser, you are being asked for a password to access those resources on the web. That's authentication.

When you share a Google Doc with your friends or colleagues, you are "authorizing" them to access something that you've created. That's an example of an authorization. Though you probably shouldn't really divide authentication and authorization so strictly. One often comes with the other altogether, hand in hand.

Now, this post is about accessing data programmatically, so there's another player on the scene. A web app, that's accessing data on behalf of:

  • a user (i.e. it impersonates that user), or
  • the application itself (more on that later)
This whole story and differences in the types of access is nicely explained on Google Accounts Authentication and Authorization page. But let's go back in time for a second and see where it all started.


In the Ice Age era Google supported different kinds of authentication and authorization for accessing their services via APIs:
Later, OAuth 2.0 protocol was introduced as an alternative, which then started replacing the above three. OpenID is absolutely a valid authentication method but it won't be much of a use here since we want to access some Google services and data, which naturally involves authorization too.

Fast-forward. The above three, ClientLogin, AuthSub and OAuth 1.0 are now deprecated and OAuth 2.0 is officially the one. Actually, this is a great thing. A side effect though, is that there are lots of code samples and libraries (written by Google devs and other people) out there using different methods (now obsolete) and it'll take some time to update.

The rest of the post will focus on OAuth 2.0 bits.

Aside from being a much simpler protocol, with OAuth 2.0 not only you can "impersonate a user", you can also let your web app act on behalf of, you know, itself, the application. That's what Google Service Account is for, also known as JWT (JSON Web Token).

Why do you need a Service account? Imagine, your app needs to update events in a shared Google Calendar of your company. To do that, you'd need at least one user going through OAuth 2.0 flow and authorizing your app so that it is able to access that calendar. What if that user goes away or unintentionally removes the authorization grant? Wouldn't it be much easier, and conceptually correct, to authorize the application itself to access that calendar's data. So, that's just one example.

I'm getting to my final point.

Recall that we have two different libraries that essentially do the same thing: help you access data via APIs - GData client and Google API client.

GData client library started when Google had - now deprecated - authorization mechanisms, apart from other things. Naturally, Google API client is focused on OAuth 2.0 only, the latest and officially supported authorization/authentication protocol.

So, what if you're already using GData, the "older" library. Or, for some reason, cannot (or don't want to?) use Google API client, the newer library. Are you left with all deprecated authentication protocols? Fortunately, no!

OAuth 2.0 support has just been added to GData library last month. So, no worries. Actually, they use the other two libraries I mentioned above to support OAuth 2.0 in GData client: HTTP Client and OAuth client.

I made this little diagram, trying to illustrate libraries dependencies, auth-wise:


In the above diagram, GData client is using OAuth client library for authentication/authorization (OAuth 2.0), and calls Google's services as it normally would.

What that means on the practical side?

Imagine the following setup: I'm stuck with GData library but I don't want to use a deprecated authentication method. Also, I don't want to impersonate any user here. Instead, I'd like my app to act on behalf of itself, i.e. I want to use a Service Account.

Let's say I'm running my app on Google App Engine. Here's how I could initialize GData SpreadsheetService client:

List scopes = Arrays.asList("https://spreadsheets.google.com/feeds");
AppIdentityService appIdentity = AppIdentityServiceFactory.getAppIdentityService();
AppIdentityService.GetAccessTokenResult accessToken = appIdentity.getAccessToken(scopes);

Credential creds = new Credential(
  BearerToken.authorizationHeaderAccessMethod());
creds.setAccessToken(accessToken.getAccessToken());

SpreadsheetService ss = new SpreadsheetService("DBM4G-demo");
ss.setOAuth2Credentials(creds);

Normally, instead of ss.setOAuth2Credentials() you would call something else, like setUserCredentials(). But, as a bonus of running my app on App Engine, I get this nice feature of App Identity service that manages OAuth 2.0 tokens for me.

How do you "share" a Google Spreadsheet with your app? Once you create the app on App Engine, head to the Application Settings page and look for Service Account Name:

It'll be something like an email address: <your-app-id>@appspot.gserviceaccount.com. Go to sharing settings of a document, like you normally would when you want to share something with another person, and share a resource with the above service account email. Your app then will have access to that resource.

There's one caveat: the above code won't run locally, on your dev server because appIdentity.getAccessToken() won't be able to get you a valid token. Although you won't probably need a real access token while unit testing, hopefully it'll get fixed soon anyway.

On the other hand, if you're just starting, or creating an app for Android, probably the best way to go is Google API Client, the newer library. As a bonus, there's a nice Google API Client Eclipse plugin.