Historical CSV → CDF Processing CLI#
The process subcommand of python -m swxsoc_reach converts the
per-day UDL CSV files produced by Historical UDL Download CLI into
CDFs, with optional S3 upload. It is sequential at the day level and
resumable: rerunning the same date range only re-attempts days that
are not already complete according to the shared telemetry CSV.
This is Phase 2 of the historical reprocessing toolchain. Phase 1 (download) is documented at Historical UDL Download CLI.
Quick start#
# Phase 1: download the day's CSV
python -m swxsoc_reach download \
--start-date 2026-01-01 --end-date 2026-01-01 \
--sensor-id REACH-1 --output-dir ./out
# Phase 2 (local-only): produce a CDF
python -m swxsoc_reach process \
--start-date 2026-01-01 --end-date 2026-01-01 \
--sensor-id REACH-1 \
--input-dir ./out --output-dir ./out_cdf
# Phase 2 (with S3 upload): also upload to S3
python -m swxsoc_reach process \
--start-date 2026-01-01 --end-date 2026-01-01 \
--sensor-id REACH-1 \
--input-dir ./out --output-dir ./out_cdf \
--upload-to-s3 --s3-bucket dev-swxsoc-pipeline-incoming
CLI reference#
Required arguments#
--start-date YYYY-MM-DD— inclusive UTC start date.--end-date YYYY-MM-DD— inclusive UTC end date.--input-dir PATH— directory holding Phase 1 download artifacts. The orchestrator globs this directory for filenames matching each UTC day; days with no matching file are recorded asSKIPPED_NO_INPUT.--output-dir PATH— directory where CDF files are written.
Optional arguments#
--telemetry-file PATH— append-only telemetry CSV. Defaults to<input-dir>/download_telemetry.csv(the same file Phase 1 writes). Use this only if you keep telemetry separate from the download artifacts.--sensor-id— REACH sensor identifier orALL(default). Drives the input filename pattern.--descriptor— UDL descriptor (defaultQUICKLOOK).--output-format— input serialization format (defaultcsv).--retry-failed— re-attempt days whose latest telemetry status isFAILED.--limit-days N— cap attempted days, counted from the first day in the range that is not already complete.--dry-run— plan only: log per-day actions, write no telemetry, do no work.-v/-vv— increase logging verbosity.
S3 upload arguments#
--upload-to-s3— when set, uploads each successful CDF to S3 viasdc_aws_utils.aws.push_science_file().UPLOADEDis the terminal status. Without this flag,PROCESSEDis terminal.--s3-bucket— destination bucket (required iff--upload-to-s3is set).--aws-region— optional AWS region. Defaults toboto3’s standard region resolution chain.
Optional dependencies#
S3 upload requires the optional net extra:
pip install 'swxsoc_reach[net]'
This installs boto3 and sdc_aws_utils. Without these, calling
--upload-to-s3 raises RuntimeError with an install hint.
Status lifecycle#
Phase 2 introduces stage-prefixed pending statuses so the row indicates which stage was interrupted:
PROCESS_PENDING— process_file invocation in progress.PROCESSED— CDF written to disk. Terminal in local-only mode.UPLOAD_PENDING— S3 upload in progress.UPLOADED— CDF successfully uploaded. Terminal in upload mode.SKIPPED_NO_INPUT— no matching CSV for the day in--input-dir. Re-runnable: the day is re-attempted if the input later appears.
Phase 1 statuses (DOWNLOAD_PENDING, DOWNLOADED,
SKIPPED_NO_DATA) are treated as “no prior process-stage row” by
the resume logic.
Telemetry schema#
Phase 2 extends the Phase 1 download telemetry CSV with these
columns (all default to "" for older rows):
process_seconds— wall-clock seconds spent inprocess_file().cdf_size_mb— size of the produced CDF in megabytes.cdf_path— local path of the CDF.upload_seconds— wall-clock seconds spent in the S3 upload.s3_bucket— destination bucket reported on a successful upload.s3_key— S3 key returned bysdc_aws_utils.aws.push_science_file().
Restart semantics#
Resume rules per UTC day, based on the most-recent telemetry row:
UPLOADED→ skip (terminal).PROCESSED+ CDF on disk + upload mode → upload only.PROCESSED+ CDF on disk + local-only mode → skip (terminal).PROCESSED+ CDF missing → re-process from CSV.UPLOAD_PENDING+ CDF on disk → upload only.UPLOAD_PENDING+ CDF missing → re-process from CSV.PROCESS_PENDING→ re-process from CSV.FAILED→ skip unless--retry-failed.SKIPPED_NO_INPUT→ re-attempt if CSV is now present.Phase 1 statuses → re-process from CSV.
Operator runbook#
- Process succeeded but upload failed
The day’s row sequence ends in
UPLOAD_PENDINGthenFAILEDwith a populatederror_type/error_message. The CDF is on disk in--output-dir. Resolve the cause (creds, bucket access, etc.), then rerun with--retry-failed: the orchestrator detects the existing CDF and uploads only.- Process failed mid-day
The day’s row sequence ends in
PROCESS_PENDINGthenFAILED. Inspect the error fields, fix the underlying issue, then rerun with--retry-failed.- CDF accidentally deleted
Delete the CDF from
--output-dirand rerun without--retry-failed. The orchestrator detects the missing CDF and re-processes from the CSV automatically.
Notes & caveats#
process_filewrites toPath.cwd()whenLAMBDA_ENVIRONMENTis unset, which is always the case for a historical local run. The orchestrator chdir’s into--output-dirfor the duration of each call and restores the prior cwd infinally.When the same telemetry CSV is shared with Phase 1 (recommended), rerunning the
downloadsubcommand afterprocesscontinues to behave correctly becausedownloadonly consults Phase 1 statuses for resume.