swxsoc_reach.calibration.transform#

Core transformation functions for REACH UDL data.

Provides functions to deduplicate records, extract sensor metadata, build sparse time-aligned arrays, and assemble an SWXData object ready for CDF output.

Functions

deduplicate_records(data)

Remove duplicate records, keeping the latest reprocessed entry.

extract_sensor_metadata(data)

Extract sorted sensor IDs, observatory names, and per-sensor flavors.

create_observation_array(data, sensor_ids, ...)

Create a sparse observation array for obValue.

create_sensor_array(sensor_grouped, ...)

Create a sparse per-sensor array for a single column.

build_swxdata(data, *[, version, global_attrs])

Assemble an SWXData object from a raw REACH DataFrame.

swxsoc_reach.calibration.transform.build_swxdata(data: DataFrame, *, version: str = '1.0.0', global_attrs: dict | None = None) SWXData[source]#

Assemble an SWXData object from a raw REACH DataFrame.

This is the main entry point for the transformation layer. It runs the following pipeline in order:

  1. Deduplicate records via deduplicate_records().

  2. Extract sensor metadata (sensor IDs, observatory names, flavors) via extract_sensor_metadata().

  3. Build common time axis from the unique UTC observation timestamps, stripping any trailing Z before parsing to avoid a stack overflow in astropy’s recursive ISO-8601 parser for large arrays.

  4. Pre-compute per-sensor groupby on a sensor-deduplicated view of the data for efficient scalar-column extraction.

  1. Build variable dict of NDData arrays

    (dose-rate cube, geolocation/quality arrays, sensor-position arrays, and label/ID metadata variables).

  1. Seed global attributes from REACHDataSchema defaults, then overlay version and any caller-supplied global_attrs.

  2. Assemble and return a SWXData instance ready to be written to CDF.

The returned SWXData contains:

Variable

Shape

Epoch

(n_times,)

sensor_labels

(n_sensors,)

sensor_ids

(n_sensors,)

dosimeter_flavor_labels

(n_flavors_max,)

dosimeter_flavor_ids

(n_flavors_max,)

dosimeter_flavors

(n_sensors, n_flavors_max)

dose_rate

(n_times, n_sensors, n_flavors_max)

lat

(n_times, n_sensors)

lon

(n_times, n_sensors)

alt

(n_times, n_sensors)

obQuality

(n_times, n_sensors)

sensor_position_x

(n_times, n_sensors)

sensor_position_y

(n_times, n_sensors)

sensor_position_z

(n_times, n_sensors)

Parameters:
  • data (pd.DataFrame) – Raw (flat) DataFrame as returned by read_udl_json() or read_udl_csv().

  • version (str, optional) – Data version string written into the global attributes (default "1.0.0").

  • global_attrs (dict or None, optional) – Additional global attributes to merge on top of the defaults provided by REACHDataSchema. Data_version is always set to version.

Returns:

SWXData – Fully assembled SWXData instance ready to be saved as CDF.

swxsoc_reach.calibration.transform.create_observation_array(data: DataFrame, sensor_ids: list[str], times_pd: DatetimeIndex, observation_flavors: list[list[str]]) ndarray[source]#

Create a sparse observation array for obValue.

For each sensor and each of its dosimeter flavors, extracts the observation values and aligns them to a common time index, filling missing entries with NaN.

Parameters:
  • data (pd.DataFrame) – Deduplicated DataFrame with columns idSensor, obDescription, obTime, and obValue.

  • sensor_ids (list[str]) – Sorted list of unique sensor IDs.

  • times_pd (pd.DatetimeIndex) – Sorted, UTC-localized DatetimeIndex of unique observation times.

  • observation_flavors (list[list[str]]) – For each sensor (matching sensor_ids order), a list of the sorted unique dosimeter flavor strings (obDescription), padded with "" so that every inner list has the same length (equal to the maximum number of flavors across all sensors).

Returns:

np.ndarray – 3-D float array of shape (n_times, n_sensors, n_flavors_max) with NaN for missing values, where n_flavors_max is the maximum number of dosimeter flavors across all sensors and the last dimension indexes those flavors for each sensor.

swxsoc_reach.calibration.transform.create_sensor_array(sensor_grouped: DataFrameGroupBy, sensor_deduped_dt: Series, sensor_ids: list[str], times_pd: DatetimeIndex, col: str) ndarray[source]#

Create a sparse per-sensor array for a single column.

Extracts values of col for each sensor from pre-grouped and deduplicated data, aligns them to a common time index, and fills missing entries with NaN.

Parameters:
  • sensor_grouped (pd.core.groupby.DataFrameGroupBy) – Pre-computed groupby on idSensor from the sensor-deduplicated DataFrame.

  • sensor_deduped_dt (pd.Series) – Datetime-converted obTime column from the sensor-deduplicated DataFrame, sharing the same index so it can be used for alignment.

  • sensor_ids (list[str]) – Sorted list of unique sensor IDs.

  • times_pd (pd.DatetimeIndex) – Sorted, UTC-localized DatetimeIndex of unique observation times.

  • col (str) – Column name to extract (e.g. 'lat', 'lon', 'alt').

Returns:

np.ndarray – 2-D float array of shape (n_times, n_sensors) with NaN for missing values.

swxsoc_reach.calibration.transform.deduplicate_records(data: DataFrame) DataFrame[source]#

Remove duplicate records, keeping the latest reprocessed entry.

For each unique combination of (idSensor, obDescription, obTime), only the row with the most recent createdAt timestamp is retained. The returned DataFrame is sorted by obTime with a reset index.

Parameters:

data (pd.DataFrame) – Raw (flat) DataFrame from read_udl_json().

Returns:

pd.DataFrame – Deduplicated DataFrame sorted by observation time.

swxsoc_reach.calibration.transform.extract_sensor_metadata(data: DataFrame) tuple[list[str], list[str], list[list[str | None]]][source]#

Extract sorted sensor IDs, observatory names, and per-sensor flavors.

Parameters:

data (pd.DataFrame) – Deduplicated DataFrame.

Returns:

  • sensor_ids (list[str]) – Sorted list of unique sensor IDs.

  • obs_names (list[str]) – Sorted list of unique observatory names.

  • observation_flavors (list[list[str | None]]) – For each sensor (matching sensor_ids order), a list of the sorted unique dosimeter flavor strings (obDescription), padded with "" so that every inner list has the same length (equal to the maximum number of flavors across all sensors).