View on GitHub

Google Maps Geocode Enrichment

This project repository provides a headless module to enrich location data in a database table using the Google Maps Geocode API.

Google Maps Geocode Enrichment

This project repository provides a headless module to enrich location data in a database table using the Google Maps Geocode API.

Table of Contents

Background

This application module was built to provide data cleansing and enrichment for physical mailing address locations scraped from public data sources. Street address locations can contain many typos, variations in spellings, misspellings and other problems. The Google Maps Geocode Service was selected as a source of data enhancement and enrichment because Google Maps is one of the best sources of location-based addresses across the world and because of the advanced Natural Language Prediction (NLP) capabilities inherent in using this service for address resolution.

The original design of this data enrichment module leveraged the following outputs from the Geocoding API:

  1. Latitude / Longitude coorindates for a street address
  2. Standardized cleansed street address label
  3. Tokenized pieces of the street address for additional filtering, analysis and grouping (i.e. parse “NC” from and address in North Carolina and store in a “State” column)
  4. Google Place ID (unique identifier for every location in the world)

The scope of this module is intended to be a headless engine running on an as-needed basis to update records in a database table as a part of a broader data management ETL pipeline. Example use cases include data warehousing and data lakes.

Technical Overview

This project repository provides source code to build a process to enrich address locations via the Google Geocode API. The source code is in python and is containerized with Docker. This serves to address 2 main concerns. First that it is written in a language that is approachable for data analysis purposes, and second it’s efficient and fast enough to provide ~500,000 updates in 1 day. Parallelization of the Google API interaction step is needed to meet the latter criteria, and containerization serves as a useful means to accomplish that through scaling features of Docker.

The following conceptual diagram highlights where this application fits into a broader data management strategy. Overview

Google provides explicit terms of use for their products. This application leverages the Geocoding API service which has policies to be considered for your intended use of the data retrieved. Sections related to pre-fetching and caching of content apply directly to this application. Google explicitly states that data from the Geocoding API must not be pre-fetched, cached, or stored except under limited conditions.

The author of this project repository and application makes no legal recommendations or interpretations on how to adhere to Google Geocoding API policies for applications developed commercially or contained within your on-premise environment. Use at your own discretion.

Project Directory Structure

Key components are organized into the following directory structure:

Prerequisites

Use of this application requires working knowledge of the following technologies:

Technical Prerequisites

Google Maps Geocode API Credentials

This application requires an API key to query the Google Maps Geocode API. You must have the Geocoding API enabled from within the Google Cloud console as well as a functioning API key.

The API key is stored in the docker-compose.yml file which builds and runs the docker application as described in Step 5 of the Deployment Instructions.

Docker Requirements

This application requires the latest version of Docker installed and running.

This application leverages a Docker network to containerize network activity. docker-compose.yml files within /cluster reference the geocode network.

docker network create geocode

Python Requirements

This application requires the latest version of Python installed.

PostgreSQL Requirements

This application was built using PostgreSQL version 11.x and currently only supports PostgreSQL. Any version of PostgreSQL should work. This application uses standard SQL table generation, reads and updates. Define the schema search path for the PostgreSQL database user to prevent communication problems. Schema-specific references are not defined within this project repository.

A containerized version of PostgreSQL is provided to help understand the application in your local environment. Step 2 of the Deployment Instructions provides further information on deploying the sample PostgreSQL docker container.

PostgreSQL Database Connection

By default this application uses a containerized version of PostgreSQL deployed locally to provide an easy interface for development and testing. Database credentials are stored in the docker-compose.yml file located within /cluster/app.

Further context on database credentials can be found in Step 5 of the Deployment Instructions.

Define PostgreSQL User Search Schema Path

The following command defines the default search path for a given PostgreSQL database user. The following command must be executed within the database.

SET search_path TO <your schema>;

Further information can be found in the PostgreSQL Schema documentation.

Technical Component Overview

The following list provides an overview of key components:

Dependencies

PostgreSQL and Kafka must be running before the enrichment process begins. The geocode_lib Docker image must be built as well. The enrichment process leverages these in a service-oriented manner.

Parallelized Enrichment Process

Queries to Google must be parallelized; to accomplish this, a single client (pg_query) interacts with postgres, places the query on a kafka topic geocode_input, multiple clients (g_query) take these queries in parallel and convert the result to a format usable by the to-progres process and place it in a result topic geocode_output. And finally a single process (pg_update) reads the contents of that output topic and updates the database with the results.

The rate limiting is controlled by pg_query.

Scaling Performance

The enrichment process can be scaled in parallel using the g_query parameter described in Step 5 of the Deployment Instructions. Suggested values are provided below.

Ensure Kafka partitions are defined appropriately as described in the Kafka Partitions section. This application was originally built and operational on a small machine with only 4 vCPUs running 110 parallel threads (g_query=110).

Kafka Partitions

To allow the parallelism concept to work, the geocode_input Kafka topic must have a number of partitions greater or equal to the number of g_query clients.

Enter the running Kafka Docker container.

docker exec -it infra_kafka_1 /bin/bash

Set partitions on a new topic

kafka-topics.sh --create --zookeeper zookeeper --topic geocode_input --partitions 50 --replication-factor 1

Or if the topic already exists

kafka-topics.sh --alter --zookeeper zookeeper --topic geocode_input --partitions 50

Additional documentation can be found on the wurstmeister Kafka Github.

Deployment Instructions

The following steps can be used to build a local environment for development and testing purposes or a production environment for data enrichment as a part of an ETL data flow process.

  1. Define Docker Network
  docker network create geocode

If the network already exists, Docker will display a message.

  1. Build and Start PostgreSQL Database Container (if desired) for development and testing purposes only. Production data management flows will most likely have other PostgreSQL database sources which can be configured within the docker-compose.yml file. Further instructions provide steps to configure this file.

Switch to Docker image definition from the project root.

  cd test/postgres_db

Build PostgreSQL Docker image using the Dockerfile.

  docker build -t postgres .

Start the PostgreSQL Container from the image named postgres created previously. The --network parameter geocode references the Docker network created previously. The -p parameter maps the default PostgreSQL port of 5432 from the Docker network to your local network. Create your own password using the POSTGRES_PASSWORD parameter.

  docker run --name postgres --network geocode -p 5432:5432/tcp -e POSTGRES_PASSWORD=YOUR_PASSWORD -d postgres

Connect to the PostgreSQL database using a tool such as pgAdmin to test the database connectivity and explore the sample data contained within the ref_location table. The sample database created in this process is the latest version of PostgreSQL with a database called postgres, a user named postgres and a password set when running the Docker container.

  1. Build Geocode Library Docker Image

Switch to Docker image definition from the project root.

  cd lib

Build geocode_lib Docker image using the Dockerfile.

  docker build -t geocode_lib .

The geocode_lib Docker image is built using python code to help translate data between the Geocode API and the PostgreSQL target database.

  1. Build and Deploy Kafka Infrastructure

Switch to Docker image definition from the project root.

  cd cluster/infra

Build and start the Kafka infrastructure using docker-compose.

  docker-compose up -d

The docker-compose command above builds and runs the Docker container in the background.

  1. Build Apps for retrieving locations to enrich, querying the geoode API and updating location data in target database.

Switch to Docker image definition from the project root.

  cd cluster/app

Build and start the application using docker-compose.

  docker-compose up -d --scale g_query=5

The docker-compose command above builds and runs the Docker container in the background.

ref_location Table

The ref_location table is used to store location information and results from this enrichment application process retrieved from the Google Maps Geocode API. This section describes the physical structure of the table as well as intended usage and data flow expectations. DDL code is provided to build this table along with useful indexes and constraints for PostgreSQL in ./doc/ref_location_ddl.sql. This application was built using PostgreSQL and currently only supports PostgreSQL.

Physical Table Definition

The following table describes each column in the ref_location table. Data types align with PostgreSQL supported data types. Data flow describes where the record value originates for each column. Record values loaded into ref_location via standard ETL data flows are marked with the acronym ETL. These are typically required fields used by this application module to query the Google Maps Geocode API service or for internal record management purposes. Record values updated by this enrichment application module are marked with the term App or Geocode API (App). Columns marked with Geocode API (App) are populated from Google Maps, through the Geocode API, managed by this application.

Column Name Data Type Description Required? Data Flow
location_hash text Unique hash or natural key of address location. Primary key constraint. Yes ETL
location text Raw address location either as an atomic-level street address or simple city / state. Yes ETL
enrichment_enabled boolean True/false value indicating if location enrichment using this application is enabled for record. Yes ETL
enrichment_status text Simple description of the enrichment process status (IN_PROCESS, COMPLETE, or null)   App
formatted_address text Standardized and cleansed address   Geocode API (App)
latitude numeric(11,6) Latitude coordinate of location.   Geocode API (App)
longitude numeric(11,6) Longitude coordinate of location.   Geocode API (App)
load_dttm timestamp When the record was loaded into this table.   App
last_update_dttm timestamp When the record was last updated.   App
google_place_id text Unique place identifier provided by Google Maps.   Geocode API (App)
google_partial_match boolean Indicates if the raw location returns a partial match from the Google Maps Geocode API.   Geocode API (App)
google_result_count integer Number of addresses resolved by the Google Maps Geocode API.   Geocode API (App)
google_result_type jsonb JSON string listing of the possible address types.   Geocode API (App)
long_street_number text Full text description. See address types and components.   Geocode API (App)
long_route text Full text description. See address types and components.   Geocode API (App)
long_political text Full text description. See address types and components.   Geocode API (App)
long_locality text Full text description. See address types and components.   Geocode API (App)
long_postal_code text Full text description. See address types and components.   Geocode API (App)
long_postal_code_suffix text Full text description. See address types and components.   Geocode API (App)
long_postal_town text Full text description. See address types and components.   Geocode API (App)
long_premise text Full text description. See address types and components.   Geocode API (App)
long_country text Full text description. See address types and components.   Geocode API (App)
long_admin_area_level_1 text Full text description. See address types and components.   Geocode API (App)
long_admin_area_level_2 text Full text description. See address types and components.   Geocode API (App)
long_admin_area_level_3 text Full text description. See address types and components.   Geocode API (App)
long_admin_area_level_4 text Full text description. See address types and components.   Geocode API (App)
long_admin_area_level_5 text Full text description. See address types and components.   Geocode API (App)
long_establishment text Full text description. See address types and components.   Geocode API (App)
short_street_number text Abbreviated description. See address types and components.   Geocode API (App)
short_route text Abbreviated description. See address types and components.   Geocode API (App)
short_political text Abbreviated description. See address types and components.   Geocode API (App)
short_locality text Abbreviated description. See address types and components.   Geocode API (App)
short_postal_code text Abbreviated description. See address types and components.   Geocode API (App)
short_postal_code_suffix text Abbreviated description. See address types and components.   Geocode API (App)
short_postal_town text Abbreviated description. See address types and components.   Geocode API (App)
short_premise text Abbreviated description. See address types and components.   Geocode API (App)
short_country text Abbreviated description. See address types and components.   Geocode API (App)
short_admin_area_level_1 text Abbreviated description. See address types and components.   Geocode API (App)
short_admin_area_level_2 text Abbreviated description. See address types and components.   Geocode API (App)
short_admin_area_level_3 text Abbreviated description. See address types and components.   Geocode API (App)
short_admin_area_level_4 text Abbreviated description. See address types and components.   Geocode API (App)
short_admin_area_level_5 text Abbreviated description. See address types and components.   Geocode API (App)
short_establishment text Abbreviated description. See address types and components.   Geocode API (App)
location_type jsonb Address type. See address types and components.   Geocode API (App)
enrichment_api_response jsonb Full API JSON response provided by Geocode API service.   Geocode API (App)

Additional documentation on geocode data output can be found on the Google Maps Platform Documentation.

Logical Data Flow

The ref_location table is designed to be an independent source and target to write any type of address location typically entered into Google Maps or any street address location service or GIS tool. Addresses may be partial, limited or only cover broad cities or states. The Google Maps Geocode API will provide a value regardless and this application enrichment process will ingest and update the ref_location table accordingly. PostgreSQL is a high performance transactional database which can manage many simultaneous reads, writes and updates. The location_hash value must be pre-computed prior to inserting into the ref_location table. This can be something simple such as an MD5 hash of the location string. Assuming this enrichment process is running, as location records are inserted into ref_location they will be updated in near real time or as the queue permits.

Common Test Examples

Interactive Geocode API JSON Response from Web Browser

To simulate the Geocode API interactively from a web browser and visually understand the JSON response, use the following URL structure with your API key:

https://maps.googleapis.com/maps/api/geocode/json?address=1030%20Richardson%20Dr,%20Raleigh,%20NC%2027603&key=ENTER_YOUR_API_KEY

Required Monitoring and Troubleshooting

Volume Disk Space

As the application threads into many parallel queries disk storage can become over leveraged. Monitor disk utilization using the df -h command from the Docker host machine. If disk storage becomes a problem, there could be orphaned disk volumes. Stop all services using the docker-compose down command and use the following command to clear dangling docker volumes.

docker volume rm $(docker volume ls -qf dangling=true)

Logs

Monitor Docker logs using the following commands.

List docker containers.

docker container ls

View logs from a container with an exit status.

docker logs <name of container that exited >

Useful Docker Commands

List docker networks: docker network ls

Display useful information about a docker network: docker network inspect <network name>

List all docker images: docker image ls

List all active and running containers: docker ps

List all active (and inactive) containers: docker ps -a

Print docker log for troubleshooting: docker logs app_pg_update_1

Remove images individually: docker rmi <image ID>

Standards

File Properties

Coding Standards

Author Contact