HDF5 Files¶

In scientific computing, sometimes, we need to store large amounts of data with quick access, the file formats we introduced before are not going to cut it. You will soon find there are many cases, HDF5 (Hierarchical Data Format) is the solution. It is a powerful binary data format with no limit on the file size. It provides parallel IO (input/output), and carries out a bunch of low level optimizations under the hood to make the queries faster and storage requirements smaller.

An HDF5 file saves two types of objects: datasets, which are array-like collections of data (like NumPy arrays), and groups, which are folder-like containers that hold datasets and other groups. There are also attributes that could associate with the datasets and groups to describe some properties. The so called hierarchical in HDF5 refers to the fact that the data could be saved like a file system, with folder-like structures, such as folder, subfolder (in HDF5, it is called group, subgroup). Groups operate like dictionaries with the keys and values, with the keys are names of the groups, and the values are the subgroups or datasets.

In order to use read/write HDF5 in Python, there are some packages or wrappers to serve the purposes. The most common two packages are PyTables and h5py. We will only introduce the h5py here. You can install h5py use conda (hope you still remember how to do that, if you forget, please go back to Chapter 1).

After we installed h5py, you can follow the quick start guide in h5py documentation to get a quick start. But here, let’s use one example to show how do we create, and read a HDF5 file. Let’s import the NumPy and h5py first.

import numpy as np
import h5py

Example: Suppose we deployed some instruments to monitor the accelerations and GPS location in Bay Area, CA. We deployed two accelerometers at Berkeley and Oakland as well as one GPS station at San Fransisco. And they record data at different sampling rates, with the accelerometer at Berkeley sample the data every 0.04 s, and 0.01 s for the sensor at Oakland. The GPS samples the location every 60 seconds in San Fransisco. Now we want to store the two types of data into a HDF5 as well as some attributes indicate where the data is recorded, start time of the recording, station name and the sampling interval.

# Generate random data for recording
acc_1 = np.random.random(1000)
station_number_1 = '1'
# unix timestamp
start_time_1 = 1542000276
# time interval for recording
dt_1 = 0.04
location_1 = 'Berkeley'

acc_2 = np.random.random(500)
station_number_2 = '2'
start_time_2 = 1542000576
dt_2 = 0.01
location_2 = 'Oakland'

hf = h5py.File('station.hdf5', 'w')

hf['/acc/1/data'] = acc_1
hf['/acc/1/data'].attrs['dt'] = dt_1
hf['/acc/1/data'].attrs['start_time'] = start_time_1
hf['/acc/1/data'].attrs['location'] = location_1

hf['/acc/2/data'] = acc_2
hf['/acc/2/data'].attrs['dt'] = dt_2
hf['/acc/2/data'].attrs['start_time'] = start_time_2
hf['/acc/2/data'].attrs['location'] = location_2

hf['/gps/1/data'] = np.random.random(100)
hf['/gps/1/data'].attrs['dt'] = 60
hf['/gps/1/data'].attrs['start_time'] = 1542000000
hf['/gps/1/data'].attrs['location'] = 'San Francisco'

hf.close()

The above code shows the core concepts in HDF5: the groups, datasets, attributes. We first create an HDF5 object for writing - station.hdf5. Then we start to store the data to different groups. We can see we have two top level groups, i.e. acc and gps, both of them contains subgroups 1 or 2 indicate the station names. Each station will contain the next level subgroup, data, that is used to store the array data we created. We could then add attributes to the groups or the data. Here we only added the dt, start_time, and location as the attributes to the datasets we store here. You can see that it is quite similar to folder-like structure, with data acc_1 saved at /acc/1/data. Lastly, we close the file object.

Now we can see that saving data in HDF5 is easy, and we could use function create_dataset and create_group as shown in the quick start. But I am more prefer to use the above approach to create multiple intermediate groups implicitly as getting access to a folder structure.

Read in the HDF5 files¶

Now suppose you send the station.hdf5 to a colleague, who wants to get access to the data. Here is how he/she will do it.

hf_in = h5py.File('station.hdf5', 'r')

list(hf_in.keys())

['acc', 'gps']

acc = hf_in['acc']

list(acc.keys())

['1', '2']

data_1 = hf_in['acc/1/data']

data_1.value[:10]

array([0.41820889, 0.89832446, 0.40229251, 0.41287538, 0.16173359,
       0.75855904, 0.89288185, 0.82944522, 0.84228139, 0.50365515])

list(data_1.attrs)

['dt', 'start_time', 'location']

data_1.attrs['dt']

0.04

data_1.attrs['location']

'Berkeley'

We can see reading a HDF5 is also easy with h5py. After we read in the HDF5 to hf_in, we could see what groups are in the HDF5 using the keys function. Then we could get access to the group members and see what contains in the subgroups as the hf_in[‘acc’], or directly specify the path to the datasets as hf_in[‘acc/1/data’] and get the array data. Of course, the attributes that associated with the data could also be accessed as a dictionary.

< 11.4 JSON Files | Contents | 11.6 Summary and Problems >

Python Numerical Methods

HDF5 Files¶

Read in the HDF5 files¶