Historical CSV → CDF Processing CLI#

The process subcommand of python -m swxsoc_reach converts the per-day UDL CSV files produced by Historical UDL Download CLI into CDFs, with optional S3 upload. It is sequential at the day level and resumable: rerunning the same date range only re-attempts days that are not already complete according to the shared telemetry CSV.

This is Phase 2 of the historical reprocessing toolchain. Phase 1 (download) is documented at Historical UDL Download CLI.

Quick start#

# Phase 1: download the day's CSV
python -m swxsoc_reach download \
    --start-date 2026-01-01 --end-date 2026-01-01 \
    --sensor-id REACH-1 --output-dir ./out

# Phase 2 (local-only): produce a CDF
python -m swxsoc_reach process \
    --start-date 2026-01-01 --end-date 2026-01-01 \
    --sensor-id REACH-1 \
    --input-dir ./out --output-dir ./out_cdf

# Phase 2 (with S3 upload): also upload to S3
python -m swxsoc_reach process \
    --start-date 2026-01-01 --end-date 2026-01-01 \
    --sensor-id REACH-1 \
    --input-dir ./out --output-dir ./out_cdf \
    --upload-to-s3 --s3-bucket dev-swxsoc-pipeline-incoming

CLI reference#

Required arguments#

--start-date YYYY-MM-DD — inclusive UTC start date.
--end-date YYYY-MM-DD — inclusive UTC end date.
--input-dir PATH — directory holding Phase 1 download artifacts. The orchestrator globs this directory for filenames matching each UTC day; days with no matching file are recorded as SKIPPED_NO_INPUT.
--output-dir PATH — directory where CDF files are written.

Optional arguments#

--telemetry-file PATH — append-only telemetry CSV. Defaults to <input-dir>/download_telemetry.csv (the same file Phase 1 writes). Use this only if you keep telemetry separate from the download artifacts.
--sensor-id — REACH sensor identifier or ALL (default). Drives the input filename pattern.
--descriptor — UDL descriptor (default QUICKLOOK).
--output-format — input serialization format (default csv).
--retry-failed — re-attempt days whose latest telemetry status is FAILED.
--limit-days N — cap attempted days, counted from the first day in the range that is not already complete.
--dry-run — plan only: log per-day actions, write no telemetry, do no work.
-v / -vv — increase logging verbosity.

S3 upload arguments#

--upload-to-s3 — when set, uploads each successful CDF to S3 via sdc_aws_utils.aws.push_science_file(). UPLOADED is the terminal status. Without this flag, PROCESSED is terminal.
--s3-bucket — destination bucket (required iff --upload-to-s3 is set).
--aws-region — optional AWS region. Defaults to boto3’s standard region resolution chain.

Optional dependencies#

S3 upload requires the optional net extra:

pip install 'swxsoc_reach[net]'

This installs boto3 and sdc_aws_utils. Without these, calling --upload-to-s3 raises RuntimeError with an install hint.

Status lifecycle#

Phase 2 introduces stage-prefixed pending statuses so the row indicates which stage was interrupted:

PROCESS_PENDING — process_file invocation in progress.
PROCESSED — CDF written to disk. Terminal in local-only mode.
UPLOAD_PENDING — S3 upload in progress.
UPLOADED — CDF successfully uploaded. Terminal in upload mode.
SKIPPED_NO_INPUT — no matching CSV for the day in --input-dir. Re-runnable: the day is re-attempted if the input later appears.

Phase 1 statuses (DOWNLOAD_PENDING, DOWNLOADED, SKIPPED_NO_DATA) are treated as “no prior process-stage row” by the resume logic.

Telemetry schema#

Phase 2 extends the Phase 1 download telemetry CSV with these columns (all default to "" for older rows):

process_seconds — wall-clock seconds spent in process_file().
cdf_size_mb — size of the produced CDF in megabytes.
cdf_path — local path of the CDF.
upload_seconds — wall-clock seconds spent in the S3 upload.
s3_bucket — destination bucket reported on a successful upload.
s3_key — S3 key returned by sdc_aws_utils.aws.push_science_file().

Restart semantics#

Resume rules per UTC day, based on the most-recent telemetry row:

UPLOADED → skip (terminal).
PROCESSED + CDF on disk + upload mode → upload only.
PROCESSED + CDF on disk + local-only mode → skip (terminal).
PROCESSED + CDF missing → re-process from CSV.
UPLOAD_PENDING + CDF on disk → upload only.
UPLOAD_PENDING + CDF missing → re-process from CSV.
PROCESS_PENDING → re-process from CSV.
FAILED → skip unless --retry-failed.
SKIPPED_NO_INPUT → re-attempt if CSV is now present.
Phase 1 statuses → re-process from CSV.

Operator runbook#

Process succeeded but upload failed: The day’s row sequence ends in UPLOAD_PENDING then FAILED with a populated error_type / error_message. The CDF is on disk in --output-dir. Resolve the cause (creds, bucket access, etc.), then rerun with --retry-failed: the orchestrator detects the existing CDF and uploads only.
Process failed mid-day: The day’s row sequence ends in PROCESS_PENDING then FAILED. Inspect the error fields, fix the underlying issue, then rerun with --retry-failed.
CDF accidentally deleted: Delete the CDF from --output-dir and rerun without --retry-failed. The orchestrator detects the missing CDF and re-processes from the CSV automatically.

Notes & caveats#

process_file writes to Path.cwd() when LAMBDA_ENVIRONMENT is unset, which is always the case for a historical local run. The orchestrator chdir’s into --output-dir for the duration of each call and restores the prior cwd in finally.
When the same telemetry CSV is shared with Phase 1 (recommended), rerunning the download subcommand after process continues to behave correctly because download only consults Phase 1 statuses for resume.