Fixing CASGEM data with OpenRefine

This is part 2 in a series of posts that chronicle our work with groundwater elevation data. Using California’s Department of Water Resources’s CASGEM system, we were able to retrieve 30 years worth of well readings for the entire state, but not without headache.

Since our last post Water Level Data from DWR, we were able to expose an unpublished ArcGIS REST endpoint. While this is not suitable for our bulk data download needs (only returns 1000 records max), it will come in handy when we need to retrieve updated entries from the CASGEM system.

Using Charles, a HTTP proxy, we managed to discover this resource:

A sample query for San Bernardino county

Or, to query the API using a GUI:

CASGEM Walkthrough

As mentioned in our last post, unfortunately, CASGEM’s bulk data download option does not provide latitude/longitude columns. Fortunately, the unique CASGEM_ID column is composed of the lat long coordinates mashed together. We will use a data transformation tool to generate separate lat long columns from the CASGEM_ID column. Data cleanup operations are not an unusual task when dealing with large datasets. We will use OpenRefine to make transformations and to structure our data for mapping and querying purposes.

OpenRefine (previously GoogleRefine), is a fast, powerful, and free data cleanup tool. There are many merits of OpenRefine, but I specifically enjoy the undo cue and being able to export your transformations to JSON. And did I say it’s fast.

With your .CSV file open, and after you cleanup the CASGEM_ID column name (mine imported with symbols in the column header), navigate to Edit cells > Transform…

CASGEM Walkthrough

Using GREL (Google Refine Expression Language), or OREL by the time your read this, we are going to split cells in the “CASGEM ID” column (while preserving the ID column) by targeting the “N” character as follows:


CASGEM Walkthrough

Now that we have two separate columns which are essentially the latitude and longitude derived from CASGEM ID, we need to format them appropriately. Clearly, “latitude” is missing a decimal point, and longitude is missing a “-” sign and its decimal point. Furthermore, we need to get rid of the “W001” at the end of the ID number. Click the dropdown in the newly created “lat” column and select Edit Cells>Transform. Furthermore, add try this GREL snippet as follows:

CASGEM Walkthrough

value.slice(0,2) + "." + value.slice(2,6)

CASGEM Walkthrough

Next, we need to format the longitude column “long” appropriately. A decimal point and minus sign are added:

"-" + value.slice(0,3) + "." + value.slice(3,7)

CASGEM Walkthrough

Ok, now we have our lat and long in separate columns. Our time series data is ready to map. OpenRefine is very handy when needing to transform data and perform cleaning over large datasets. Here, we did some very simple transformations. For other applications and work-flows check out this OpenRefine tutorial series:

Google Refine 2.0 – Introduction (1 of 3) (video version 2)

As mentioned above, DWR rate-limits its servers to 1000 records per query. If you are here to get data and spare yourself the grief, we will be publishing a bulk download option and data API shortly. Check back here, or keep an eye on announcements via our twitter: @CAWaterAtlas, NCWA blog, or newsletter.

Otherwise, for those that are looking to make this data useful, and need to customize the way the data is presented, I hope this post is helpful and clear. Please submit your questions and comments below. Look forward to your feedback.

This entry was posted in California Water Atlas. Bookmark the permalink.

Comments are closed.