I work with lots of environmental time series data from stationary instruments. This post describes why you should avoid mixing data and metadata in a single file and instead suggests an easy-to-implement, easy-to-use, maximally compact format consisting of two .csv files linked by unique identifiers.
The world of air quality data is undergoing rapid evolution. For decades, the EPA has managed data from a network of ~1200 regulatory grade (i.e. expensive and professionally maintained) air quality monitors that produce hourly measurements at specific locations. Aggregated “raw” data are made publicly available in hourly data files. These .csv files are well documented and adhere to the tidy data standard with one measurement per record. The EPA AQS is a fantastic example of a government agency consistently doing the hard work of data ingest, harmonization, quality control and aggregation to produce useful datasets in a format that anyone can work with and make sense of. Kudos to the EPA for the AQS data files!
One downside, however, is that the records in these files contain lots of repeated, location- and instrument-specific metadata resulting in large file sizes. For small collections of data, repeated metadata is a small price to pay for the convenience of having everything you want in a single file. But this price increases as the number of records goes up. For stationary time series with many records, this convenience quickly becomes unaffordable.
The 2020 wildfire season in the United States produced lots of smoke that impacted millions of people. This impact can be investigated using hourly PM2.5 measurements in the AQS data: https://aqs.epa.gov/aqsweb/airdata/hourly_88101_2020.zip
This file has 5,749,884 rows and is 44 MB compressed. (That seems OK. We should be able to work with 2-3 x 44 MB.) But when we unzip this file it suddenly blows up to 1.5 GB!
What is going on? A look at the first few lines tells us:
"State Code","County Code","Site Num","Parameter Code","POC","Latitude","Longitude",... "01","073","0023","88101",3,33.553056,-86.815,"WGS84","PM2.5 - Local Conditions",... "01","073","0023","88101",3,33.553056,-86.815,"WGS84","PM2.5 - Local Conditions",... "01","073","0023","88101",3,33.553056,-86.815,"WGS84","PM2.5 - Local Conditions",... "01","073","0023","88101",3,33.553056,-86.815,"WGS84","PM2.5 - Local Conditions",... "01","073","0023","88101",3,33.553056,-86.815,"WGS84","PM2.5 - Local Conditions",... "01","073","0023","88101",3,33.553056,-86.815,"WGS84","PM2.5 - Local Conditions",... "01","073","0023","88101",3,33.553056,-86.815,"WGS84","PM2.5 - Local Conditions",... "01","073","0023","88101",3,33.553056,-86.815,"WGS84","PM2.5 - Local Conditions",... "01","073","0023","88101",3,33.553056,-86.815,"WGS84","PM2.5 - Local Conditions",... ...
Lots of repeated metadata!
For every instrument location, about 250 characters of metadata (=250 bytes) will be repeated 24 (hours) * 365 (days) = 8,760 times. This .csv file has over a Megabyte of unneeded, unwanted, repeated metadata for each of the 1200 instrument locations.
I have addressed this issue before in various workshops and meetings:
If you collect data repeatedly along any spatiotemporal axis (x-y-z-t), you need to keep data and metadata in separate files.
Mixing data and metadata is not just bad for time series data. It is bad for many types of data.
This example is a bit contrived, but imagine you have been tasked to convert the record of a conversation into a tidy .csv file for some natural language processing. Notes of the conversation might be formatted like this:
Jon (male, 60, Seattle, WA, scientist)
Jennifer (female 50, Boston, MA, nurse)
Sally (female, 40, Phoenix, AR, chef)
Jon — Welcome to our survey.
Sally — Glad to be here.
Jennifer — Me too.
Jon — How do you like our new data format?
Sally — I doubt I will ever use a csv file.
Jennifer — I think this data format has a problem.
Jon — What is the problem?
Jennifer — Too much repeated metadata.
…
I would NOT recommend storing an hour long conversation like this:
name,sex,age,city,state,profession,word_order,word Jon,male,60,Seattle,WA,scientist,1,welcome Jon,male,60,Seattle,WA,scientist,2,to Jon,male,60,Seattle,WA,scientist,3,our Jon,male,60,Seattle,WA,scientist,4,survey Sally,female,40,Phoenix,AR,chef,5,glad Sally,female,40,Phoenix,AR,chef,6,to Sally,female,40,Phoenix,AR,chef,7,be Sally,female,40,Phoenix,AR,chef,8,here Jennifer,female,50,Boston,MA,nurse,9,me Jennifer,female,50,Boston,MA,nurse,10,too Jon,male,60,Seattle,WA,scientist,11,how Jon,male,60,Seattle,WA,scientist,12,do ...
Although “tidy”, this format seems absurd because we all make the following assumptions:
We don’t need to repeat the speaker context with every word. A more understandable, more compact format for storing and sharing this data would be:
id,words Jon,"welcome to our survey" Sally,"glad to be here" Jennifer,"me too" ...
with detailed metadata about Jon, Sally and Jennifer in a separate table:
id,sex,age,city,state,profession Jon,male,60,Seattle,WA,scientist Sally,female,40,Phoenix,AR,chef Jennifer,female,50,Boston,MA,nurse
The atmospheric scientists who run weather models have been working with huge amounts of data since before time began (at “1970-01-01 00:00:00 UTC”). Weather model output files store data for every grid cell. A modern model will have many of these. The WRF 1.33km CANSAC grid contains 700x808x40 grid points for each time step. At 4 bytes per floating point number, that’s 90.5 MB per parameter per time step. A single model run will have dozens of parameters and 72 or more time steps. It’s easy to understand why this community developed the NetCDF data format and associated software libraries at a time when many of us were still playing Pong.
Just storing a single ASCII date stamp with every data record would increase data volumes by 700*808*40*(23 characters in “1970-01-01 00:00:00 UTC”) = 0.5 GB per timestep … for a single piece of repeated metadata!
Clearly, if you have LOTS of data, you need to adopt special formats and special software. Like it or not, this is what the future looks like for environmental time series data. The data deluge is well under way with the increased deployment of low-cost sensors producing data records every few minutes.
But we don’t have to embrace NetCDF quite yet and can still use ASCII .csv files for environmental time series as long as we:
Keep data and metadata in separate files.
The most efficient way to store environmental time series data is to put all of your measurement data into a data.csv
file with the first column containing a datestamp recording when measurements were made and all other columns containing those measurements. Column headers will contain unique identifiers for each “device-deployment”, aka unique time series.
Device-deployment metadata containing instrument and location information will go into a meta.csv
file with the unique identifiers in the first column and as much metadata as you want in the other columns. For a single device at a single location, the meta.csv
file will only have a single record. For multiple device-deployments all measuring the same parameter, meta.csv
will have one record for each instrument.
Here is a tiny example with hourly measurements from 4 air quality instruments:
# pm2.5_data.csv "datetime","id_1","id_2","id_3","id_4" 2023-09-16 18:00:00,0,3,2,5 2023-09-16 19:00:00,1,1,1,39 2023-09-16 20:00:00,3,1,1,9 2023-09-16 21:00:00,1,NA,1,7 2023-09-16 22:00:00,0,NA,1,8 ... # pm2.5_meta.csv "id","longitude","latitude","timezone","stateCode","countyName",... "id_1",-108.290764,37.350174,"America/Denver","CO","Montezuma",... "id_2",-112.041621,46.58432,"America/Denver","MT","Lewis and Clark",... "id_3",-114.894524,47.195087,"America/Denver","MT",NA,... "id_4",-122.094742,45.385597,"America/Los_Angeles","OR",NA,...
The data.csv
file can grow as needed without any bloat from repeated metadata. You can also add as many fields as you like to meta.csv
, knowing that this data will not be unnecessarily repeated. This structure is easy to generate, easy to work with and far smaller than an equivalent “tidy” structure with all metadata in every record.
Best of all, packages already exist in R to work with this structure!
A suite of R packages has been developed over the last decade to support efficient data processing and visualization of air quality and other environmental time series data. Packages AirMonitor, AirMonitorPlots, AirSensor2 and RAWSmet all use this data model and are built on top of the MazamaTimeSeries package which includes functions that mimic dplyr operations on a single table. Please see the detailed documentation and examples associated with each of these packages.
The advantages of structuring your data with separate data and metadata are pretty undeniable:
Best wishes for fast and efficient analysis of environmental time series data!
Continue reading: How NOT to format time series data