RGoogleStorage - Accessing the Google Storage API from R

Duncan Temple Lang

University of California at Davis

Department of Statistics


Getting Started

This is some initial documentation to describe how to get started with the RGoogleStorage package. It is incomplete and any comments, modifications or suggestions are welcome, including corrections to the code or additional tests.

One motivation for writing this package is to explore OAuth2.

The essential things we want to be able to do are

  1. create a bucket - makeBucket()

  2. list the contents of a bucket - listBucket()

  3. upload a file or content generally - upload()

  4. download a document - download()

  5. delete a document or bucket - removeBucket()

  6. query the access control list - getACL()

  7. change the access control list - setACL()

  8. copy an object - copy()

Getting Started

The first thing to do before using this package is to create some storage files on Google's Storage system. To do this, you have to register with the API. See https://code.google.com/apis/storage/docs/signup.html. The good news is that the storage is free until December 31, 2011. You may have to enable billing and supply credit card details.

After enabling access to the API, you can create content via a Web browser. Later, we will see how to do this in R.

Get Permissions and Authorization/Access

The next step is to register an application with Google so that you have an ID and secret that can be used to make requests to access other people's data. This is the interesting part of OAuth. We have a use, say Jane, and an application say RGoogleStorage. Jane starts using the RGoogleStorage package which, for now, is the same as the RGoogleStorage application.

To use the package, first set the following two options in your R session

options(Google.storage.ID = '....',
        Google.storage.Secret = '...')

These are the project ID and the corresponding secret. You generate these by logging into Google's Console and clicking on the "API Access" item.

The easiest way to provide your project ID and secret is to set them as options above. The R functions will look there for them there.

Don't put your id and secret in R code that you give to others. These are tied to your Google account and if other people can see them, they can use them as if they were you.

To use the functions in the RGoogleStorage package, you need to give the R code permission to access your Google Storage buckets. To do this, you call the function getPermission() . You specify whether you want to grant read, write or full permissions. The default is 'read'. You can also specify the ID of the client in the call, but again this is best set as the Google.storage.ID option in R. Calling getPermission() will display a Web page in your Web browser. That will prompt you to either grant or deny access for the client application. We'll assume you give permission and the Web browser will show you another page which contains a string of characters. Cut and paste this to the R prompt

Cut-and-paste the permission string from your Web browser here:

This will turn the string into an explicit OAuth2PermissionToken object and you can assign that to an R variable for future use. So

token = getPermission()

gives us the permission token.

Alternatively, you can call getPermission() with ask set to FALSE and just assign the string to a variable or use

token = permissionToken('4/eqN7LMFoapQfVCCBr5PfVG6JKffG')

Now we have the permission token. We convert this into an authorization token with getAuth() :

auth = getAuth(token)

We can also use

auth = as(token, "OAuth2AuthorizatonToken")

We use this authorization token object in our calls to the primary functions of the storage API.

Let's start by creating a bucket with makeBucket() :

makeBucket(auth, "proj1")

We can find out what it contains with listBucket() :

listBucket(auth, "proj1")
<0 rows> (or 0-length row.names)

So, as we expect, this is empty.

Next, let's upload some content and put it into an object in this bucket, say foo:

upload(auth, "proj1/foo", I("This is text"), "text/text")

Here we specify the content as an in-memory string. We tell upload() that this is not the name of a file by specifying that this is of class AsIs via the I() function. We also specify that the content is text. Google storage remembers this and will tell us when we try to download the content. If we don't specify this, Google will have to assume the content is an unknown binary format.

If you are going to be making multiple requests, it is advisable to explicitly compute the authorization token with

auth = getAuth(token)

This can then be passed to each of the functions rather than the permission token. This will avoid an extra request in each function call to obtain the authorization. Furthermore, you will have to get a new permission token for each request.

We can also upload binary data and from a file. Let's create a PNG plot:

png("foo.png")
plot(1:10, main = "Silly plot")
dev.off()

upload(auth, "proj1/myPlot", "foo.png", "image/png")

Now let's list the contents of our bucket:

b = listBucket(auth, "proj1")
     Key        LastModified                               ETag  Size StorageClass
1    foo 2011-05-18 13:00:20 "5578833a0c6cb26394a1414140718cab"    12     STANDARD
2 myPlot 2011-05-18 12:59:26 "0e49b507686b4ad978ef53832c11c157" 14184     STANDARD
                                                                     Owner
1 00b4903a97f8e92544e2b4ed4781d9c280240153f95b068940955770edc6aaeeduncantl
2 00b4903a97f8e92544e2b4ed4781d9c280240153f95b068940955770edc6aaeeduncantl

So now let's retrieve one of our objects:

tt = download(auth, "proj1/foo")
[1] "This is text"
attr(,"Content-Type")
            
"text/text" 

When we get the plot,

tt = download(auth, "proj1/myPlot")

this returns a raw vector and the Content-Type attribute is "image/png", as we specified.

Moving and Removing Object

We can copy one object to another name with copy()

copy(auth, "proj1/foo", "proj1/bar")

We can remove an object or a bucket with (the misnamed) removeBucket() , e.g.

removeBucket(auth, "proj1/foo")

We can verify it has been removed with

listBucket(auth, "proj1")

and checking the names.

Refreshing a Token

The object returned by getAuth() is valid for one hour. After that, it expires. However, we can refresh/renew it. We use refreshToken() to do this:

auth = refreshToken(auth)

By default, this checks the expiration time against the current time and only refreshes the token if necessary. One can specify force = TRUE to force the refreshing request.

Controlling Access

Each bucket and object in a bucket has an access control list (ACL) limiting who has permission to read, change, delete the object. We can query the ACL for an object using getACL() . This returns a data frame giving a) the "scope", i.e. the entity to whom the permission applies, e.g. Google user ID, Google user email, Google Group ID or email, AllUsers or AllAuthenticatedUsers, the type of the scope (how to interpret the scope value), c) the permission value, and d) the name of the scope. We also get the identity of the owner. For example,

getACL(auth, "proj1/myPlot")
Object of class "ACLDataFrame"
    permission                                                                    scope     name     type
1 FULL_CONTROL 00b4903a97f8e92544e2b4ed4781d9c280240153f95b068940955770edc6aaeeduncantl duncantl UserById
Slot "owner":
                                                          duncantl 
"00b4903a97f8e92544e2b4ed4781d9c280240153f95b068940955770edc6aaee" 

When uploading content via upload() , we can specify a simple access that applies to a broad group of potential accessors. We can use any of 'project-private', 'private', 'public-read', 'public-read-write', 'authenticated-read', 'bucket-owner-read', 'bucket-owner-full-control' as the value for the access parameter.

Generally, we want to have more specific control over the permissions we grant. We want to be able to declare that user A can read an object, group B can write an object, or all users can read an object. We also may want to grant some entities full control of an object or bucket. To specify this, we use setACL() . We first have to build the access list. In common cases, we can specify this as a simple list of the form

list(id = 'permission', id = 'permission', ...)

where id is a Google user ID or email or Google group ID or email, or AllUsers or AllAuthenticatedUsers and permission is one of 'read', 'write', 'full'. So

setACL(auth, "proj1/myPlot", 
          list('bob@gmail.com' = 'read',
               'duncan@gmail.com' = 'full_control'))

Note this function is untested as yet.