# other.codes — Analysis Methods

This document describes how the analysis pipeline converts raw field photographs
into the published metrics, vector traces, and visualisations.

## Pipeline overview

1. **Photograph** — marks are photographed in the field using an iPhone.
   Images are captured as HEIC and converted to JPEG for processing.

2. **Segmentation** — Each photograph is opened in a browser-based annotation
   tool (Flask, port 5050). the mark is isolated from the background using either:
   - SAM2 (Segment Anything Model 2, Hiera Large) with point or box prompts, or
   - a Random Forest pixel classifier trained on brush-annotated foreground /
     background strokes (multi-scale features, ilastik-style).
   The output is a binary mask (white = mark, black = background) and an RGBA
   isolated crop, saved as `{stem}_mask.png` and `{stem}_isolated.png`.

3. **Rasterisation** — The mask is cropped to its tight bounding box and saved
   to `data/rasters/` as a clean binary PNG. This removes irrelevant background
   pixels and ensures all shape metrics are computed on the mark alone.

4. **Vectorisation** — The cropped mask is inverted (black mark on white ground)
   and passed to vtracer (spline mode, binary colourmode, filter\_speckle=4)
   via its Python API. The result is an SVG path trace of the mark outline,
   saved to `data/vectors/`. These are the files distributed in vectors.zip.

5. **Feature extraction** — Around 30 morphological metrics are computed from
   the binary mask for each mark. See the Metrics section below for definitions.

6. **PCA and clustering** — The 17 scale-independent metrics listed below are
   standardised and passed to PCA (for the scatter plot) and Ward hierarchical
   clustering (for the dendrogram). See the Analysis settings section.

## Metrics

All metrics are computed from the binary mask unless otherwise noted.
Normalisation uses √area as the scale factor so that results are independent
of image resolution and physical mark size.

### Shape

| Metric | Description |
|---|---|
| area\_px | Total foreground pixel count |
| bbox\_w / bbox\_h | Bounding box width and height (pixels) |
| aspect\_ratio | bbox\_w / bbox\_h |
| fill\_density | area\_px / (bbox\_w × bbox\_h) |
| compactness | 4π × area / perimeter² (1.0 = perfect circle) |
| solidity | area / convex hull area (largest connected component) |
| euler\_number | Topological proxy for enclosed holes (e.g. O, A, 4 have holes) |
| eccentricity | Eccentricity of best-fit ellipse (largest connected component) |
| n\_components | Number of disconnected foreground regions |
| perimeter | Total contour length (pixels) |
| perimeter\_norm | perimeter / √area |

### Stroke width

Stroke width is estimated per skeleton pixel as twice the Euclidean distance
transform value at that point — i.e. the diameter of the largest circle that
fits inside the stroke at that location.

| Metric | Description |
|---|---|
| mean\_stroke\_width | Mean stroke diameter across all skeleton pixels |
| stroke\_width\_std | Standard deviation of stroke diameter |
| stroke\_width\_cv | Coefficient of variation: std / mean |
| stroke\_width\_norm | mean\_stroke\_width / √area |
| max\_stroke\_width | Maximum stroke diameter |

### Skeleton topology

The skeleton (medial axis) is computed with `skimage.morphology.skeletonize`.
Branch points and endpoints are identified by counting 8-connected neighbours
of each skeleton pixel.

| Metric | Description |
|---|---|
| skeleton\_length | Total skeleton pixel count |
| skeleton\_density | skeleton\_length / √area |
| skeleton\_to\_area | skeleton\_length / area\_px |
| skeleton\_to\_perimeter | skeleton\_length / perimeter |
| n\_branch\_points | Skeleton pixels with ≥ 3 neighbours |
| n\_endpoints | Skeleton pixels with exactly 1 neighbour |
| branching\_density | n\_branch\_points / skeleton\_length |
| n\_loops\_est | max(0, branch\_pts − endpoints + 1) — Euler-characteristic loop estimate |
| endpoint\_branch\_ratio | n\_endpoints / n\_branch\_points |

### Vector complexity

Measured from the vtracer SVG output, not from the raster mask.

| Metric | Description |
|---|---|
| n\_svg\_paths | Number of SVG `<path>` elements |
| svg\_path\_complexity | Total path command count across all paths |
| svg\_closed\_paths | Number of paths ending with a Z (close) command |
| svg\_closed\_ratio | svg\_closed\_paths / n\_svg\_paths |

## Metrics used for clustering and PCA

The following 17 metrics are passed to PCA and to the hierarchical clustering
that produces the dendrogram. They are selected for being scale-independent
and not directly redundant with each other. All remaining metrics are retained
in the public metrics.csv for reference but do not influence the analysis.

```
aspect_ratio         fill_density         compactness
solidity             euler_number         eccentricity
n_components         stroke_width_cv      stroke_width_norm
skeleton_density     skeleton_to_area     skeleton_to_perimeter
branching_density    n_loops_est          endpoint_branch_ratio
svg_closed_ratio     perimeter_norm
```

## Analysis settings

### Preprocessing

All 17 clustering metrics are standardised to zero mean and unit variance using
`sklearn.preprocessing.StandardScaler` before any analysis. This ensures that
features measured in different units (pixels, ratios, counts) contribute equally.

### PCA

| Setting | Value |
|---|---|
| Library | `sklearn.decomposition.PCA` |
| n\_components | min(n\_marks, n\_features, 10) |
| Input | StandardScaler-normalised feature matrix |
| Plot | PC1 (horizontal axis) × PC2 (vertical axis) |

Variance explained by each PC is shown on the axis labels of pca\_vector.svg
and in pca\_interpretation.md.

### Hierarchical clustering (dendrogram)

| Setting | Value |
|---|---|
| Library | `scipy.cluster.hierarchy.linkage` + `dendrogram` |
| Linkage method | Ward |
| Distance metric | Euclidean (on standardised features) |
| Family grouping | `scipy.cluster.hierarchy.fcluster`, 3 groups, maxclust criterion |

The three family groups shown as coloured strips in dendrogram.svg are
determined by cutting the Ward tree into exactly 3 clusters using `fcluster`.

## Software

| Package | Role |
|---|---|
| Python 3.11 | Runtime |
| numpy | Numerical arrays |
| scipy | Distance transform, clustering, dendrogram |
| scikit-image | Skeletonize, regionprops, contour tracing |
| scikit-learn | StandardScaler, PCA |
| Pillow | Image I/O |
| vtracer 0.6.x | Raster → SVG vectorisation |
| Flask | Browser-based annotation tool |
| SAM2 (Meta) | Segment Anything Model 2 — point/box-prompted segmentation |
| matplotlib | Reference PCA scatter (raster thumbnails, internal use) |

## Limitations

- **Small dataset** — the current collection contains approximately 12 marks from
  a single session. Results are exploratory and should not be generalised.
- **Single collector** — all photographs were taken by one person in one city.
  Geographic and stylistic coverage is narrow at this stage.
- **Manual segmentation** — masks are hand-annotated and may contain errors,
  particularly where backgrounds are complex or marks are faint.
- **Scale invariance** — all metrics are normalised to be independent of image
  resolution and mark size. Absolute scale information (physical size of the mark
  in the real world) is not captured.
- **No temporal or geographic metadata** — the public dataset does not include
  location, date, or photographer information.
