Photo by Rajeshwar Bachu on Unsplash

Loading API Data into Google BigQuery with Cloud Functions and Scheduler

Google BigQuery is a fantastic tool for SQL based analysis of data — it has powerful integration with other Google Cloud services, and even better, processing anything under 1tb a month is free. In this article, we’ll look at one approach to loading API data into your Google Cloud Project ready for analysis, visualization, or simply archiving — as well as scheduling the task, so your data is always as up to date as it can be.

Phil Hather
6 min readFeb 20, 2021

--

The prerequisites for this are that you have a Google account and you’ve created a Google Cloud Project to organise your data. If you don’t, this is very simple to do — a quick Google search will help. We’ll also assume that you have some knowledge of Python or another coding language for extracting the API data — although this step will be very simple, and example code will be provided.

A diagram of the functions used for creating the pipeline

Above is a diagram showing the workflow and Google Cloud functions we’ll be using. Don’t be put off by the number of services involved — it’s actually very simple to set up and works seamlessly.

Creating the Cloud Function

First things first, let’s find the API data that we want to analyse. For this example, I’ll be using data from the Fantasy Premier League (EPL) API, and this endpoint in particular:

https://fantasy.premierleague.com/api/bootstrap-static/

This data is in JSON format which works well, but really, any data that you find that can be parsed into a CSV or new line delimited JSON will work.

Next, let’s write a Python script that extracts this into a Pandas dataframe. Pandas is a Python library that is very useful for processing data. We’ll also be using the Requests library to make the API request.

import requests
import pandas as pd
data = requests.get('https://fantasy.premierleague.com/api/bootstrap-static/')
json = data.json()
df = pd.DataFrame(json['teams'])

The above code simply imports the ‘teams’ endpoint data from our API location into a dataframe. As we want to store this data in Google Cloud Storage so it can be streamed into BigQuery, we’ll also need to use the Google Cloud Storage Python library. The convenient thing about running this code from within Google Cloud, is that we don’t need to do anything to authorise access — that’s all taken care of for us.

from google.cloud import storageclient = storage.Client(project='example-project-123')
bucket = client.get_bucket('example-storage-bucket')

Simply replace the example placeholders above with your Cloud Project name, and the Storage bucket that you want to store the data within.

We’ll now need to turn the dataframe into a file within our Google Cloud Storage bucket, firstly by creating the file “blob” and then by populating it with the data in CSV format.

blob = bucket.blob('example_filename.csv')
blob.upload_from_string(df.to_csv(index = False),content_type = 'csv')

Lastly, let’s package all of those sections together as a function that can be conveniently triggered.

def api_to_gcs(url, endpoint, filename):
data = requests.get(url)
json = data.json()
df = pd.DataFrame(json[endpoint])
client = storage.Client(project='example-project-123')
bucket = client.get_bucket('example-storage-bucket')
blob = bucket.blob(filename)
blob.upload_from_string(df.to_csv(index = False),content_type = 'csv')

As you can see, I’ve set some variables for this function so that it can the URL, endpoint, and filename can be changed as required. You might also find it convenient to set additional variables, such as the project or bucket name.

Now let’s create the Cloud Function that we’ll use to run this code. Navigate to the Cloud Functions dashboard and click Create Function.

Screenshot showing the first step of creating a Cloud Function

Name the function whatever you like and choose your region. Select Cloud Pub/Sub as the Trigger, and create a new Cloud Pub/Sub topic from the drop down menu — we’ll come back to this later.

Click Next to proceed to writing the Code.

Screenshot showing editing the Cloud Function code

Select Python 3.8 from the Runtime dropdown list, and copy and paste your code into the “main.py” file. We’ll then need to define a function to call with the default variables “data” and “context” — don’t worry, we don’t need to use these anywhere in our script. In this example, I’ve also called the function “main” and it simply calls our api_to_gcs function. Set the “Entry point” box with the name of our “main” function, then navigate to “requirements.txt”.

Screenshot showing requirements.txt Cloud Function file

The requirements.txt file simply needs to contain the names and versions of the packages that we’re using in the script — as above.

Now it’s time to deploy the Cloud Function!

Click the Deploy button and wait for your function to display the green tick next to it — you can now test it by navigating to “Test function” in the Actions menu. You can also see the logs through this menu — this is where you’ll see whether the function has run successfully after you’ve tested it.

Scheduling Updates with Cloud Scheduler and Pub/Sub

Now that we’ve created the function that imports the API data into Storage for us, let’s schedule it so that the data will update frequently.

Navigate to Cloud Scheduler and select Create Job.

Screenshot showing the Create Job screen of Cloud Scheduler

Give your Scheduler job a name, and set the frequency using unix-cron format. If you’re not familiar with this, a great website to generate strings in this format is crontab.guru. I’ve set the job above to run every day at 4:00AM GMT.

Set the Target to Pub/Sub , and the Topic to the one that you created earlier with the Cloud Function — this is how we’re going to trigger the function. Pub/Sub (publish/subscribe) is a secure service that allows communication between our Google Cloud platforms — our scheduled job will publish a “message” to our Pub/Sub topic. Our function will be listening out for this, and will run when the message is received.

That’s it!

Our function will now run at the defined schedule — importing API data to our Google Cloud Storage bucket.

Making Google Cloud Storage data available in BigQuery

Finally, let’s make the data we’ve imported available in BigQuery for analysis.

Navigate to BigQuery, select your Project and hit the “Create Dataset” button (skip this step if you already have a Dataset that you’d like to store this data in).

Once you’re in your Dataset, press the + button to create a new table.

Screenshot showing the Create Table screen within BigQuery

Select Google Cloud Storage from the “Create table from:” dropdown menu. Copy your Storage bucket location followed by your newly imported CSV file name into the “Select file” text box. Alternatively, if you would like the table to reference multiple files (as long as they’re a uniform format), you can “glob” the files in the bucket with an asterisk — e.g. example_bucket/*.csv

Select your Project, Dataset, and new Table name in the boxes provided. Select “External table” from the Table type dropdown — this allows BigQuery to “stream” the data from Storage in real time, rather than making a static copy.

Screenshot showing the Create Table screen within BigQuery

I’d suggest setting the Schema name and types manually — the “Auto-detect” feature doesn’t always get it right. Lastly, set the table to skip the header row of your file in the “Advanced options” expansion. Hit the button to create the table.

Done!

You’ve now imported your API data, scheduled regular updates, and prepared the data for analysis in BigQuery.

I hope this article serves as a good starting point for creating this kind of simple data pipeline — let me know if you have any questions or comments.

--

--