swxsoc_reach.calibration.transform#
Core transformation functions for REACH UDL data.
Provides functions to deduplicate records, extract sensor metadata, build sparse time-aligned arrays, and assemble an SWXData object ready for CDF output.
Functions
|
Remove duplicate records, keeping the latest reprocessed entry. |
|
Extract sorted sensor IDs, observatory names, and per-sensor flavors. |
|
Create a sparse observation array for |
|
Create a sparse per-sensor array for a single column. |
|
Assemble an |
- swxsoc_reach.calibration.transform.build_swxdata(data: DataFrame, *, version: str = '1.0.0', global_attrs: dict | None = None) SWXData[source]#
Assemble an
SWXDataobject from a raw REACH DataFrame.This is the main entry point for the transformation layer. It runs the following pipeline in order:
Deduplicate records via
deduplicate_records().Extract sensor metadata (sensor IDs, observatory names, flavors) via
extract_sensor_metadata().Build common time axis from the unique UTC observation timestamps, stripping any trailing
Zbefore parsing to avoid a stack overflow in astropy’s recursive ISO-8601 parser for large arrays.Pre-compute per-sensor groupby on a sensor-deduplicated view of the data for efficient scalar-column extraction.
- Build variable dict of
NDDataarrays (dose-rate cube, geolocation/quality arrays, sensor-position arrays, and label/ID metadata variables).
- Build variable dict of
Seed global attributes from
REACHDataSchemadefaults, then overlay version and any caller-supplied global_attrs.Assemble and return a
SWXDatainstance ready to be written to CDF.
The returned
SWXDatacontains:Variable
Shape
Epoch(n_times,)sensor_labels(n_sensors,)sensor_ids(n_sensors,)dosimeter_flavor_labels(n_flavors_max,)dosimeter_flavor_ids(n_flavors_max,)dosimeter_flavors(n_sensors, n_flavors_max)dose_rate(n_times, n_sensors, n_flavors_max)lat(n_times, n_sensors)lon(n_times, n_sensors)alt(n_times, n_sensors)obQuality(n_times, n_sensors)sensor_position_x(n_times, n_sensors)sensor_position_y(n_times, n_sensors)sensor_position_z(n_times, n_sensors)- Parameters:
data (pd.DataFrame) – Raw (flat) DataFrame as returned by
read_udl_json()orread_udl_csv().version (str, optional) – Data version string written into the global attributes (default
"1.0.0").global_attrs (dict or None, optional) – Additional global attributes to merge on top of the defaults provided by
REACHDataSchema.Data_versionis always set to version.
- Returns:
SWXData – Fully assembled SWXData instance ready to be saved as CDF.
- swxsoc_reach.calibration.transform.create_observation_array(data: DataFrame, sensor_ids: list[str], times_pd: DatetimeIndex, observation_flavors: list[list[str]]) ndarray[source]#
Create a sparse observation array for
obValue.For each sensor and each of its dosimeter flavors, extracts the observation values and aligns them to a common time index, filling missing entries with NaN.
- Parameters:
data (pd.DataFrame) – Deduplicated DataFrame with columns
idSensor,obDescription,obTime, andobValue.times_pd (pd.DatetimeIndex) – Sorted, UTC-localized DatetimeIndex of unique observation times.
observation_flavors (list[list[str]]) – For each sensor (matching
sensor_idsorder), a list of the sorted unique dosimeter flavor strings (obDescription), padded with""so that every inner list has the same length (equal to the maximum number of flavors across all sensors).
- Returns:
np.ndarray – 3-D float array of shape
(n_times, n_sensors, n_flavors_max)with NaN for missing values, wheren_flavors_maxis the maximum number of dosimeter flavors across all sensors and the last dimension indexes those flavors for each sensor.
- swxsoc_reach.calibration.transform.create_sensor_array(sensor_grouped: DataFrameGroupBy, sensor_deduped_dt: Series, sensor_ids: list[str], times_pd: DatetimeIndex, col: str) ndarray[source]#
Create a sparse per-sensor array for a single column.
Extracts values of col for each sensor from pre-grouped and deduplicated data, aligns them to a common time index, and fills missing entries with NaN.
- Parameters:
sensor_grouped (pd.core.groupby.DataFrameGroupBy) – Pre-computed groupby on
idSensorfrom the sensor-deduplicated DataFrame.sensor_deduped_dt (pd.Series) – Datetime-converted
obTimecolumn from the sensor-deduplicated DataFrame, sharing the same index so it can be used for alignment.times_pd (pd.DatetimeIndex) – Sorted, UTC-localized DatetimeIndex of unique observation times.
col (str) – Column name to extract (e.g.
'lat','lon','alt').
- Returns:
np.ndarray – 2-D float array of shape
(n_times, n_sensors)with NaN for missing values.
- swxsoc_reach.calibration.transform.deduplicate_records(data: DataFrame) DataFrame[source]#
Remove duplicate records, keeping the latest reprocessed entry.
For each unique combination of
(idSensor, obDescription, obTime), only the row with the most recentcreatedAttimestamp is retained. The returned DataFrame is sorted byobTimewith a reset index.- Parameters:
data (pd.DataFrame) – Raw (flat) DataFrame from
read_udl_json().- Returns:
pd.DataFrame – Deduplicated DataFrame sorted by observation time.
- swxsoc_reach.calibration.transform.extract_sensor_metadata(data: DataFrame) tuple[list[str], list[str], list[list[str | None]]][source]#
Extract sorted sensor IDs, observatory names, and per-sensor flavors.
- Parameters:
data (pd.DataFrame) – Deduplicated DataFrame.
- Returns:
sensor_ids (list[str]) – Sorted list of unique sensor IDs.
obs_names (list[str]) – Sorted list of unique observatory names.
observation_flavors (list[list[str | None]]) – For each sensor (matching
sensor_idsorder), a list of the sorted unique dosimeter flavor strings (obDescription), padded with""so that every inner list has the same length (equal to the maximum number of flavors across all sensors).