palimpsestr: Probabilistic Decomposition of Archaeological Palimpsests Using Stratigraphic Entanglement Fields

Archaeological palimpsests — deposits in which material from multiple occupation phases is superimposed and partially intermixed — represent one of the most persistent analytical challenges in field archaeology. The palimpsestr R package implements the Stratigraphic Entanglement Field (SEF) framework, a probabilistic method for decomposing palimpsests into latent depositional phases by jointly modelling four evidence domains: spatial proximity, vertical distribution, chronological overlap, and cultural similarity. The package provides a mixed-type finite mixture model — a diagonal-covariance Gaussian over the numeric features combined with a per-phase categorical distribution over the cultural class — fitted via Expectation-Maximisation, augmented with taphonomic weighting, a uniform noise/outlier component, and stratigraphic entanglement constraints. Three interpretable diagnostics — the Stratigraphic Entanglement Index (SEI), Excavation Stratigraphic Energy (ESE), and Palimpsest Dissolution Index (PDI) — allow practitioners to assess deposit coherence, detect intrusive finds, and evaluate overall phase separability. This article describes the methodology and demonstrates a complete analytical workflow on a realistic multi-period Roman villa dataset.

Introduction

Archaeological palimpsests arise when successive episodes of human occupation leave material traces within the same sedimentary matrix, producing deposits in which artefacts, ecofacts, and features from chronologically distinct phases co-occur in close spatial and vertical proximity. The phenomenon is ubiquitous: from deeply stratified Near Eastern tells to shallow open-air Palaeolithic sites where millennia of activity are compressed into a few centimetres of deposit. Disentangling the depositional history of such contexts is essential for any behavioural, technological, or settlement-pattern inference, yet the analytical toolkit available to excavators has remained surprisingly limited.

The Harris Matrix (Harris, 1989) provides a rigorous formalism for recording observed stratigraphic relationships, but its relationships are deterministic and binary — a unit either overlies another or it does not. It cannot express the gradual, probabilistic mixing that characterises many palimpsests. Conversely, spatial clustering methods such as k-means (Kintigh and Ammerman, 1982) operate on coordinate data alone, ignoring vertical stratigraphy, chronological evidence from radiocarbon or ceramic seriation, and cultural-typological information. Neither approach offers a unified probabilistic framework, and neither produces quantitative diagnostics for deposit integrity or mixing intensity.

palimpsestr addresses these limitations by integrating four evidence domains — horizontal coordinates, vertical elevation, chronological range, and cultural class — into a single model that produces soft (probabilistic) phase assignments together with three diagnostic statistics. The Stratigraphic Entanglement Index (SEI) quantifies pairwise depositional coherence; Excavation Stratigraphic Energy (ESE) captures local disruption; and the Palimpsest Dissolution Index (PDI) summarises global phase separability. All three are interpretable by field practitioners and can be exported to GIS for spatial interrogation.

Methodology

The Stratigraphic Entanglement Index (SEI)

The SEI quantifies how strongly two finds \(i\) and \(j\) are “entangled” — that is, how much evidence supports their co-occurrence within the same depositional event. It is defined as:

\[SEI_{ij} = \frac{w_s}{d_{xy}(i,j) + \epsilon} + \frac{w_z}{\max(|z_i - z_j|, z_{floor})} + w_t \cdot O_t(i,j) + w_c \cdot \mathbb{1}[c_i = c_j]\]

where:

  • \(d_{xy}(i,j)\) is the Euclidean distance in the horizontal plane, and \(\epsilon\) is a small constant preventing division by zero. The first term captures spatial proximity: finds close together in plan are more likely to derive from the same depositional event.
  • \(|z_i - z_j|\) is the absolute vertical separation, with \(z_{floor}\) a minimum threshold. The second term captures stratigraphic proximity: finds at the same depth are more likely contemporaneous.
  • \(O_t(i,j)\) is the chronological overlap ratio, computed from the intersection of the temporal ranges \([t_{min,i}, t_{max,i}]\) and \([t_{min,j}, t_{max,j}]\) relative to their union. Finds whose date ranges overlap strongly receive higher entanglement scores.
  • \(\mathbb{1}[c_i = c_j]\) is an indicator function equal to 1 when both finds belong to the same cultural or typological class, and 0 otherwise.
  • \(w_s, w_z, w_t, w_c\) are domain weights (defaulting to equal weighting) that allow the analyst to encode prior knowledge about which evidence domains are most reliable for a given site.

Each SEI component is normalised to [0, 1] by dividing by its within-dataset maximum, making weights directly interpretable as relative importance. The SEI matrix is row-normalised and used both as a regularisation penalty during model fitting and as the basis for the ESE diagnostic.

The Diagonal Gaussian Mixture EM

Phase decomposition proceeds via a Gaussian mixture model fitted by Expectation-Maximisation (EM). The feature matrix for each find is constructed by concatenating: scaled horizontal coordinates \((x, y)\), vertical coordinate \((z)\), temporal midpoint and temporal span of the date range, and a one-hot encoding of the cultural class variable. This multi-domain feature representation ensures that the clustering draws on all available evidence simultaneously.

The algorithm is initialised with K-means, and the resulting hard assignments are converted to softmax probabilities. The EM procedure then iterates between an E-step (computing posterior phase membership probabilities given current parameters) and an M-step (updating mixture weights, means, and diagonal covariance matrices given current memberships). Two modifications distinguish SEF from a standard Gaussian mixture:

  1. Taphonomic weighting: if a taphonomic integrity score is available for each find, it is used to down-weight poorly preserved or disturbed items during parameter estimation, reducing their influence on cluster centroids.
  2. Stratigraphic entanglement penalties: the SEI matrix is incorporated as a soft constraint, encouraging finds with high mutual entanglement to receive similar phase assignments.

Convergence is assessed by monitoring the log-likelihood; the algorithm terminates when the relative change falls below a tolerance threshold or a maximum number of iterations is reached.

Diagnostic Statistics

Three statistics summarise different aspects of the decomposition:

  • Excavation Stratigraphic Energy (ESE): for each find, ESE measures its dissimilarity from its spatial neighbours in terms of phase assignment. A high ESE value indicates that a find is surrounded by items assigned to different phases — a signature of depositional disruption, bioturbation, or post-depositional displacement.

  • Palimpsest Dissolution Index (PDI): a global statistic defined as \(PDI = 1 - \bar{H} / \log(K)\), where \(\bar{H}\) is the mean Shannon entropy of the phase-probability vectors across all finds and \(K\) is the number of phases. PDI ranges from 0 (complete mixing; every find is equally likely to belong to any phase) to 1 (perfect separation; every find is assigned to exactly one phase with certainty). Values above 0.7 generally indicate well-resolved phases.

  • Entropy: the Shannon entropy of each find’s phase-probability vector, \(H_i = -\sum_k p_{ik} \log p_{ik}\). Low entropy means confident assignment; high entropy signals ambiguity, typically at phase boundaries or within heavily mixed zones.

  • Intrusion score: a composite metric combining high entropy, high ESE, and low SEI connectivity. Finds exceeding a threshold on this composite are flagged as candidate intrusions — items that are spatially or stratigraphically anomalous relative to their assigned phase.

The Dataset: Villa Romana

We demonstrate the complete analytical workflow using villa_romana, a real archaeological dataset from Poggio Gramignano (Lugnano in Teverina, TR), a multi-period Roman villa with an annexed Late Antique infant cemetery. The dataset contains 615 inventoried finds from 54 stratigraphic units, with taphonomic scores assigned by the excavator.

library(palimpsestr)
data(villa_romana)
cat("Finds:", nrow(villa_romana), "\n")
#> Finds: 615
cat("Stratigraphic units:", length(unique(villa_romana$context)), "\n")
#> Stratigraphic units: 54
cat("Material classes:", length(unique(villa_romana$class)), "\n")
#> Material classes: 17
cat("Date range:", min(villa_romana$date_min), "to", max(villa_romana$date_max), "\n")
#> Date range: -650 to 550
cat("Mean taf_score:", round(mean(villa_romana$taf_score), 3), "\n")
#> Mean taf_score: 0.731

The site spans eight archaeological periods:

Period Chronology Description
10 650–350 BCE Pre-Roman (foundation pits)
9 30 BCE–80 CE Augustan–Flavian (villa construction)
8 80–137 CE Flavian
7 138–179 CE Antonine–Severan
6 180–324 CE Severan–Late Antique
5 324–425 CE Late Antique
4 425–450 CE Late Antique (cemetery)
3 450–550 CE Late Antique–Early Medieval

The deposit includes levelling fills (US 76, 117, 118), backfills, and redeposited material, reflected in the taphonomic scores (taf_score: 0.5 for disturbed, 0.7 for redeposited, 1.0 for in situ).

cat("Material classes:\n")
#> Material classes:
sort(table(villa_romana$class), decreasing = TRUE)
#> 
#>         Reperto Ceramico        Reperto Anforaceo Materiale Da Costruzione 
#>                      417                       76                       38 
#>           Reperto Vitreo        Reperto Metallico      Reperto Numismatico 
#>                       25                       19                       10 
#>  Reperti Archeozoologici          Reperto Lapideo             Campionature 
#>                        7                        6                        5 
#>     Scarti Di Produzione         Intonaco Dipinto        Materiale Lapideo 
#>                        5                        1                        1 
#>        Pietre Dure-Gemme    Reperti Antropologici       Reperto Faunistico 
#>                        1                        1                        1 
#>  Resti Di Pavimentazioni                   Stucco 
#>                        1                        1
str(villa_romana)
#> 'data.frame':    615 obs. of  9 variables:
#>  $ id       : chr  "VRPG_0001" "VRPG_0002" "VRPG_0003" "VRPG_0004" ...
#>  $ x        : num  2299041 2299041 2299041 2299041 2299041 ...
#>  $ y        : num  4714126 4714126 4714126 4714126 4714126 ...
#>  $ z        : num  284 284 284 284 284 ...
#>  $ context  : chr  "US_102" "US_102" "US_102" "US_102" ...
#>  $ date_min : num  425 425 425 425 425 425 425 425 425 425 ...
#>  $ date_max : num  450 450 450 450 450 450 450 450 450 450 ...
#>  $ class    : chr  "Reperto Ceramico" "Reperto Ceramico" "Reperto Ceramico" "Reperto Ceramico" ...
#>  $ taf_score: num  0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 ...

Phase Count Selection

When the number of phases is unknown, compare_k() fits models across a range of \(K\) values and reports BIC, ICL, and PDI for each. The optimal \(K\) minimises BIC while maximising PDI.

ck <- compare_k(
  villa_romana,
  k_values = 2:7,
  tafonomy = "taf_score",
  context  = "context",
  seed     = 42
)
print(ck)
#>   k       pdi mean_entropy mean_local_sei mean_energy   loglik        bic
#> 1 2 0.9999986 9.931278e-07       1175.378   0.8821964 2479.278  -4618.210
#> 2 3 0.9999894 1.165145e-05       1175.378   0.8821964 3300.199  -6086.667
#> 3 4 0.9999916 1.159321e-05       1175.378   0.8821964 3536.812  -6386.510
#> 4 5 0.9987285 2.046475e-03       1175.378   0.8821964 6654.301 -12448.104
#> 5 6 0.9992440 1.354517e-03       1175.378   0.8821964 6858.413 -12682.944
#> 6 7 0.9987104 2.509385e-03       1175.378   0.8821964 7358.103 -13508.941
#>          icl tot_withinss  pseudo_bic
#> 1  -4618.209    1389.5727  565.520742
#> 2  -6086.653     622.9523  104.225628
#> 3  -6386.496     512.2302   15.980832
#> 4 -12445.587     140.8267 -746.026039
#> 5 -12681.278     453.8948    5.838235
#> 6 -13505.854     441.5804   21.030521
if (requireNamespace("ggplot2", quietly = TRUE)) {
  gg_compare_k(ck)
}
Model selection diagnostics: BIC, PDI, and mean entropy across candidate phase counts. The dashed line marks the optimal K.

Model selection diagnostics: BIC, PDI, and mean entropy across candidate phase counts. The dashed line marks the optimal K.

Fitting the Model (K = 4)

Based on the selection diagnostics and archaeological knowledge of the site (4 occupation periods), we fit with K = 4 and multiple random initialisations.

fit <- fit_sef(
  villa_romana,
  k        = 4,
  tafonomy = "taf_score",
  context  = "context",
  n_init   = 10,
  seed     = 42
)
print(fit)
#> <sef_fit>
#> Observations: 615
#> Estimated phases: 4
#> Mean entropy: 0.002
#> Mean local SEI: 1175.378
#> Mean energy: 0.882
#> LogLik: 3715.846
#> BIC: -6744.579
#> ICL: -6742.059
#> PDI: 0.999
#> Converged: yes
#> Class model: multinomial
#> Initialisations: 10
summary(fit)
#> $n
#> [1] 615
#> 
#> $k
#> [1] 4
#> 
#> $pdi
#> [1] 0.9985223
#> 
#> $mean_entropy
#> [1] 0.002048532
#> 
#> $mean_energy
#> [1] 0.8821964
#> 
#> $loglik
#> [1] 3715.846
#> 
#> $bic
#> [1] -6744.579
#> 
#> $phase_count
#> 
#>   1   2   3   4 
#> 126  48 121 320

Model Overview

cat("PDI:", pdi(fit), "\n")
#> PDI: 0.9985223
sef_summary(fit)
#> $n
#> [1] 615
#> 
#> $k
#> [1] 4
#> 
#> $pdi
#> [1] 0.9985223
#> 
#> $mean_entropy
#> [1] 0.002048532
#> 
#> $mean_energy
#> [1] 0.8821964
#> 
#> $loglik
#> [1] 3715.846
#> 
#> $bic
#> [1] -6744.579
#> 
#> $phase_count
#> 
#>   1   2   3   4 
#> 126  48 121 320

EM Convergence

if (requireNamespace("ggplot2", quietly = TRUE)) {
  gg_convergence(fit)
}
EM convergence trace showing log-likelihood stabilisation across iterations.

EM convergence trace showing log-likelihood stabilisation across iterations.

Spatial Diagnostics

Phase-field Map

if (requireNamespace("ggplot2", quietly = TRUE)) {
  gg_phasefield(fit)
}
Dominant phase assignment. Point size reflects assignment confidence (inverse entropy). Larger points indicate higher confidence.

Dominant phase assignment. Point size reflects assignment confidence (inverse entropy). Larger points indicate higher confidence.

The phase-field map reveals the spatial structure of the decomposition. The model assigns each find to one of four chronologically coherent phases despite the presence of post-depositional disturbance.

Entropy Map

if (requireNamespace("ggplot2", quietly = TRUE)) {
  gg_entropy(fit)
}
Spatial distribution of Shannon entropy. High-entropy zones (bright) mark areas of uncertain phase assignment, typically at phase boundaries or in heavily mixed deposits.

Spatial distribution of Shannon entropy. High-entropy zones (bright) mark areas of uncertain phase assignment, typically at phase boundaries or in heavily mixed deposits.

Entropy mapping is particularly useful for identifying the spatial extent of mixing. Zones of elevated entropy correspond to areas where multiple depositional phases are interleaved and cannot be cleanly separated.

Energy Map (ESE)

if (requireNamespace("ggplot2", quietly = TRUE)) {
  gg_energy(fit)
}
Excavation Stratigraphic Energy field. Elevated ESE values (bright) highlight zones of depositional disruption where neighbouring finds belong to different phases.

Excavation Stratigraphic Energy field. Elevated ESE values (bright) highlight zones of depositional disruption where neighbouring finds belong to different phases.

The energy map highlights local disruptions. Clusters of high-ESE finds may indicate bioturbation channels, pit cuts, or other post-depositional disturbances.

Intrusion Detection Map

if (requireNamespace("ggplot2", quietly = TRUE)) {
  gg_intrusions(fit, top_n = 10)
}
Candidate intrusions overlaid on the phase-field map. Circled finds exhibit high entropy, high stratigraphic energy, and low entanglement with their neighbours. Top 10 suspects are labelled.

Candidate intrusions overlaid on the phase-field map. Circled finds exhibit high entropy, high stratigraphic energy, and low entanglement with their neighbours. Top 10 suspects are labelled.

Vertical Phase Profile

if (requireNamespace("ggplot2", quietly = TRUE)) {
  gg_phase_profile(fit)
}
Vertical phase profile: depth vs horizontal position. Phase 1 (deepest) corresponds to the earliest occupation; Phase 4 (shallowest) to the most recent.

Vertical phase profile: depth vs horizontal position. Phase 1 (deepest) corresponds to the earliest occupation; Phase 4 (shallowest) to the most recent.

The vertical profile confirms the expected stratigraphic ordering: earlier phases occupy deeper strata, later phases are shallower.

Model-output diagnostic plots

fit <- fit_sef(demo_moderate, k = 3, context = "context", noise = TRUE)
gg_phase_composition(fit)        # per-phase class profile (heatmap)
gg_unit_coherence(fit)           # within-unit coherence
gg_outliers(fit)                 # intrusion ranking (noise posterior)
gg_direction(fit)                # residual vs latent finds

Intrusion Analysis

The detect_intrusions() function computes a composite intrusion probability for each find, combining entropy, ESE, and SEI connectivity into a single score.

di <- detect_intrusions(fit)
suspects <- di[di$intrusion_prob > 0.5, ]
cat("Suspected intrusions:", nrow(suspects), "out of", nrow(villa_romana), "finds",
    sprintf("(%.1f%%)\n", 100 * nrow(suspects) / nrow(villa_romana)))
#> Suspected intrusions: 10 out of 615 finds (1.6%)
if (nrow(suspects) > 0) {
  # Merge with source data and phase assignments
  phase_tab <- as_phase_table(fit)
  susp_data <- merge(suspects, villa_romana[, c("id", "context", "class")], by = "id")
  susp_data <- merge(susp_data, phase_tab[, c("id", "dominant_phase")], by = "id")
  susp_data <- susp_data[order(susp_data$intrusion_prob, decreasing = TRUE), ]
  susp_show <- susp_data[, c("id", "context", "class", "dominant_phase", "intrusion_prob")]
  names(susp_show) <- c("Find", "US", "Class", "Phase", "Intrusion Prob.")
  knitr::kable(head(susp_show, 15), row.names = FALSE, digits = 3,
               caption = "Top suspected intrusions ranked by intrusion probability.")
}
Top suspected intrusions ranked by intrusion probability.
Find US Class Phase Intrusion Prob.
VRPG_0181 US_116 Reperto Numismatico 4 0.621
VRPG_0489 US_276 Reperto Metallico 2 0.593
VRPG_0180 US_116 Reperto Vitreo 4 0.591
VRPG_0431 US_204 Materiale Da Costruzione 2 0.578
VRPG_0417 US_173 Materiale Da Costruzione 3 0.568
VRPG_0155 US_114 Reperto Numismatico 4 0.551
VRPG_0157 US_114 Reperto Numismatico 4 0.551
VRPG_0162 US_114 Reperto Vitreo 4 0.526
VRPG_0173 US_114 Reperto Vitreo 4 0.526
VRPG_0174 US_114 Reperto Vitreo 4 0.526

Flagged finds warrant closer examination: they may represent genuinely intrusive material (e.g., items displaced by animal burrowing or construction cuts), or they may sit at genuine phase boundaries where assignment is inherently ambiguous.

Stratigraphic Unit Purity

The us_summary_table() function aggregates diagnostics per stratigraphic unit, reporting dominant phase, purity, mean entropy, energy, and intrusion count.

us_tab <- us_summary_table(fit)
knitr::kable(us_tab, row.names = FALSE, digits = 3,
             caption = "Per-US diagnostics: dominant phase, purity, mean entropy, mean ESE, and number of intrusions.")
Per-US diagnostics: dominant phase, purity, mean entropy, mean ESE, and number of intrusions.
context n_finds dominant_phase purity mean_entropy mean_energy mean_local_sei n_intrusions
US_102 13 4 1 0.034 0.912 1475.679 0
US_104 93 4 1 0.000 0.771 1454.472 0
US_105 19 4 1 0.000 0.669 1462.312 0
US_108 3 4 1 0.000 0.857 1489.492 0
US_111 25 4 1 0.000 0.967 1296.115 0
US_114 22 4 1 0.015 0.825 1440.291 5
US_116 6 4 1 0.025 1.017 1337.525 2
US_117 27 4 1 0.000 0.910 1339.410 0
US_118 16 1 1 0.010 0.934 1366.511 0
US_133 1 4 1 0.000 1.477 1092.234 0
US_134 41 4 1 0.001 0.766 1459.614 0
US_136 33 1 1 0.000 0.989 1376.467 0
US_137 15 1 1 0.000 1.117 1198.393 0
US_144 18 1 1 0.000 1.119 1236.642 0
US_148 1 4 1 0.000 0.706 1481.237 0
US_149 15 4 1 0.000 0.663 1440.034 0
US_150 48 4 1 0.000 0.887 1407.464 0
US_151 10 1 1 0.000 0.524 1273.794 0
US_157 3 1 1 0.000 1.180 1428.854 0
US_158 1 1 1 0.000 1.182 1076.402 0
US_161 4 1 1 0.003 1.402 1060.050 0
US_166 2 1 1 0.000 1.366 1027.267 0
US_173 1 3 1 0.006 2.229 606.910 1
US_180 1 4 1 0.000 1.474 1066.935 0
US_183 5 4 1 0.000 0.618 1362.369 0
US_189 2 1 1 0.026 1.315 1270.426 0
US_197 3 1 1 0.000 1.675 1041.681 0
US_201 1 1 1 0.003 1.520 1102.207 0
US_204 4 2 1 0.000 1.609 710.525 1
US_209 1 1 1 0.000 1.458 1091.588 0
US_21 2 2 1 0.000 1.000 592.936 0
US_210 41 2 1 0.000 0.283 854.277 0
US_211 1 1 1 0.000 1.665 1044.096 0
US_215 1 1 1 0.001 1.699 1034.543 0
US_220 1 1 1 0.000 1.696 998.151 0
US_258 8 1 1 0.002 1.591 1061.888 0
US_276 1 2 1 0.000 2.289 368.350 1
US_287 25 3 1 0.000 0.942 559.458 0
US_296 10 3 1 0.000 0.969 604.886 0
US_297 17 3 1 0.000 0.848 654.842 0
US_299 1 3 1 0.000 1.008 781.722 0
US_302 2 3 1 0.000 1.000 60.161 0
US_303 1 3 1 0.000 1.007 781.464 0
US_304 3 3 1 0.000 1.378 383.272 0
US_306 5 3 1 0.000 1.026 497.678 0
US_307 30 3 1 0.000 0.913 579.065 0
US_308 4 3 1 0.000 1.032 523.701 0
US_315 2 3 1 0.000 1.132 642.667 0
US_316 2 3 1 0.000 1.194 615.690 0
US_319 6 3 1 0.000 0.967 720.365 0
US_324 1 3 1 0.000 1.370 413.377 0
US_72 5 1 1 0.014 1.499 1204.926 0
US_76 11 3 1 0.000 1.001 618.110 0
US_89 1 1 1 0.000 1.526 1010.339 0

Units with purity > 90% indicate consistent phase assignment within the context. Lower purity and higher mean entropy correspond to disturbed contexts where bioturbation or construction cuts have introduced material from adjacent phases.

n_pure <- sum(us_tab$purity >= 0.9)
cat(n_pure, "out of", nrow(us_tab), "units have purity >= 90%\n")
#> 54 out of 54 units have purity >= 90%

Phase Transition Matrix

Shows how often finds from phase \(i\) sit directly above finds from phase \(j\), revealing spatial adjacency between phases:

tm <- phase_transition_matrix(fit)
print(tm)
#>        phase1 phase2 phase3 phase4
#> phase1    119      2      0      5
#> phase2      1     44      2      1
#> phase3      1      2    118      0
#> phase4      5      0      0    314

High off-diagonal values indicate zones of contact between phases. This is expected at the boundary between successive occupation periods.

Model Validation

When true phase assignments are known (e.g., from simulation), palimpsestr provides validation metrics. We demonstrate with the simulated demo_moderate dataset:

data(demo_moderate)
fit_val <- fit_sef(demo_moderate, k = 3, tafonomy = "taf_score", context = "context", seed = 42)
ari <- adjusted_rand_index(fit_val, demo_moderate$true_phase)
cat("Adjusted Rand Index:", round(ari, 3), "\n")
#> Adjusted Rand Index: 1
confusion_matrix(fit_val, demo_moderate$true_phase)
#>          true
#> estimated  2  1  3
#>         1 67  0  0
#>         2  0 67  0
#>         3  0  0 66
if (requireNamespace("ggplot2", quietly = TRUE)) {
  gg_confusion(fit_val, demo_moderate$true_phase)
}
Confusion matrix heatmap on simulated data. Strong diagonal = correct phase recovery.

Confusion matrix heatmap on simulated data. Strong diagonal = correct phase recovery.

For the real Poggio Gramignano data, no ground truth is available. Validation relies on archaeological coherence: the model should recover phases consistent with the site periodisation, and flag known disturbed contexts (levelling fills, backfills) as mixed.

Sensitivity Analysis of Domain Weights

The model combines four evidence domains (spatial, vertical, temporal, cultural) with configurable weights c(ws, wz, wt, wc). Since version 0.14.0 these weights scale the corresponding feature dimensions and therefore enter the mixture likelihood directly (in earlier versions they affected only the SEI matrix). To assess how weight choices affect the results, we compare three configurations:

configs <- list(
  equal    = c(ws = 1, wz = 1, wt = 1, wc = 1),
  spatial  = c(ws = 2, wz = 1, wt = 0.5, wc = 0.5),
  temporal = c(ws = 0.5, wz = 0.5, wt = 2, wc = 1)
)

sens <- do.call(rbind, lapply(names(configs), function(nm) {
  w <- configs[[nm]]
  f <- fit_sef(villa_romana, k = 4, weights = w, seed = 42,
               tafonomy = "taf_score", context = "context")
  di <- detect_intrusions(f)
  data.frame(
    config = nm,
    pdi = pdi(f),
    mean_entropy = mean(f$entropy, na.rm = TRUE),
    n_intrusions = sum(di$intrusion_prob > 0.5),
    stringsAsFactors = FALSE
  )
}))
knitr::kable(sens, digits = 4,
             caption = "Sensitivity of PDI, mean entropy, and intrusion count to weight configuration.")
Sensitivity of PDI, mean entropy, and intrusion count to weight configuration.
config pdi mean_entropy n_intrusions
equal 1.0000 0.0000 3
spatial 0.9985 0.0020 7
temporal 0.9985 0.0021 10

Because the weights now reshape the feature space, they genuinely change the recovered phases — emphasising the most informative domain for a given site improves separability, while down-weighting a noisy domain reduces spurious structure. The optimize_weights() function provides a data-driven approach to weight selection via cross-validated held-out log-likelihood; it fits under var_structure = "spherical" (the structure under which the weights are statistically identifiable) and corrects for the Jacobian of the weighting so that configurations are directly comparable. The returned weights are intended to be used with fit_sef(..., var_structure = "spherical").

Bootstrap Confidence Intervals

Uncertainty on key statistics can be quantified via bootstrap resampling:

bs <- bootstrap_sef(fit, n_boot = 50, verbose = FALSE)
knitr::kable(bs, digits = 4,
             caption = "Bootstrap confidence intervals (50 replicates) for key model statistics.")
Bootstrap confidence intervals (50 replicates) for key model statistics.
statistic estimate lower upper se
pdi 0.9985 0.9975 1.0000 0.0007
mean_entropy 0.0020 0.0000 0.0034 0.0010
mean_energy 0.8822 0.8156 0.9239 0.0296
loglik 3715.8461 3455.0765 6518.5120 981.1524
if (requireNamespace("ggplot2", quietly = TRUE)) {
  gg_bootstrap(bs)
}
Bootstrap confidence intervals for PDI, mean entropy, mean energy, log-likelihood, and ARI.

Bootstrap confidence intervals for PDI, mean entropy, mean energy, log-likelihood, and ARI.

Harris Matrix Integration

palimpsestr can incorporate stratigraphic information from a Harris Matrix as a penalty during model fitting. The harris_from_contexts() function auto-generates a penalty matrix from the mean depth of finds in each context. Alternatively, read_harris() imports an external Harris Matrix from a CSV edge list.

H <- harris_from_contexts(villa_romana, z_col = "z", context_col = "context")
cat("Harris penalty matrix:", nrow(H), "x", ncol(H), "\n")
#> Harris penalty matrix: 615 x 615
fit_h <- fit_sef(villa_romana, k = 4, harris = H,
                 tafonomy = "taf_score", context = "context",
                 n_init = 5, seed = 42)
cat("PDI without Harris:", round(pdi(fit), 4), "\n")
#> PDI without Harris: 0.9985
cat("PDI with Harris:   ", round(pdi(fit_h), 4), "\n")
#> PDI with Harris:    1
di_no <- detect_intrusions(fit)
di_h  <- detect_intrusions(fit_h)
cat("Intrusions without Harris:", sum(di_no$intrusion_prob > 0.5), "\n")
#> Intrusions without Harris: 10
cat("Intrusions with Harris:   ", sum(di_h$intrusion_prob > 0.5), "\n")
#> Intrusions with Harris:    3

Validate phases against stratigraphy

validate_phases_harris(fit)
#>     upper  lower upper_phase lower_phase consistent
#> 1  US_302 US_303           3           3       TRUE
#> 2  US_303 US_299           3           3       TRUE
#> 3  US_299 US_304           3           3       TRUE
#> 4  US_304 US_296           3           3       TRUE
#> 5  US_296 US_297           3           3       TRUE
#> 6  US_297 US_308           3           3       TRUE
#> 7  US_308 US_287           3           3       TRUE
#> 8  US_287 US_306           3           3       TRUE
#> 9  US_306 US_316           3           3       TRUE
#> 10 US_316 US_307           3           3       TRUE
#> 11 US_307 US_324           3           3       TRUE
#> 12 US_324 US_315           3           3       TRUE
#> 13 US_315  US_76           3           3       TRUE
#> 14  US_76 US_204           3           2      FALSE
#> 15 US_204 US_319           2           3       TRUE
#> 16 US_319 US_276           3           2      FALSE
#> 17 US_276 US_151           2           1      FALSE
#> 18 US_151 US_210           1           2       TRUE
#> 19 US_210 US_173           2           3       TRUE
#> 20 US_173 US_136           3           1      FALSE
#> 21 US_136 US_166           1           1       TRUE
#> 22 US_166 US_102           1           4       TRUE
#> 23 US_102 US_137           4           1      FALSE
#> 24 US_137 US_108           1           4       TRUE
#> 25 US_108 US_118           4           1      FALSE
#> 26 US_118 US_144           1           1       TRUE
#> 27 US_144  US_89           1           1       TRUE
#> 28  US_89 US_161           1           1       TRUE
#> 29 US_161 US_158           1           1       TRUE
#> 30 US_158 US_157           1           1       TRUE
#> 31 US_157 US_114           1           4       TRUE
#> 32 US_114 US_104           4           4       TRUE
#> 33 US_104 US_116           4           4       TRUE
#> 34 US_116 US_209           4           1      FALSE
#> 35 US_209 US_211           1           1       TRUE
#> 36 US_211 US_201           1           1       TRUE
#> 37 US_201  US_21           1           2       TRUE
#> 38  US_21 US_134           2           4       TRUE
#> 39 US_134 US_189           4           1      FALSE
#> 40 US_189 US_197           1           1       TRUE
#> 41 US_197 US_215           1           1       TRUE
#> 42 US_215 US_220           1           1       TRUE
#> 43 US_220 US_258           1           1       TRUE
#> 44 US_258 US_180           1           4       TRUE
#> 45 US_180 US_105           4           4       TRUE
#> 46 US_105 US_133           4           4       TRUE
#> 47 US_133 US_149           4           4       TRUE
#> 48 US_149 US_148           4           4       TRUE
#> 49 US_148  US_72           4           1      FALSE
#> 50  US_72 US_111           1           4       TRUE
#> 51 US_111 US_150           4           4       TRUE
#> 52 US_150 US_117           4           4       TRUE
#> 53 US_117 US_183           4           4       TRUE

The consistent column flags whether the dominant phase ordering respects stratigraphic depth. FALSE indicates an inversion that may warrant investigation.

Interpretive Report

The report_sef() function generates a plain-language summary of the analysis, suitable for inclusion in excavation reports. It is available in English and Italian.

report_sef(fit, lang = "en")
#> # SEF Analysis Report
#> 
#> ## Overview
#> 
#> The Stratigraphic Entanglement Field model with **K = 4 phases** was fitted to a dataset of **615 finds** distributed across **54 stratigraphic contexts**.
#> 
#> The Palimpsest Dissolution Index (PDI) is **0.999**, indicating **excellent phase separation**. Mean entropy is 0.0020 and mean ESE is 0.882.
#> 
#> Model diagnostics: LogLik = 3715.8, BIC = -6744.6.
#> 
#> ## Phase Composition
#> 
#> ### Phase 1 (126 finds, 20.5%)
#> 
#> - **Dating**: average 438 CE (range: 425 to 450)
#> - **Material classes**: Materiale Da Costruzione (7), Reperto Anforaceo (26), Reperto Ceramico (68), Reperto Lapideo (1), Reperto Metallico (2), Reperto Numismatico (7), Reperto Vitreo (15)
#> - **Mean entropy**: 0.0025; **Mean energy**: 1.113
#> - **Contexts**: US_118, US_136, US_137, US_144, US_151, US_157, ... (19 total)
#> - **Assessment**: Entropy is very low, suggesting this phase is internally coherent and well-defined.
#> 
#> ### Phase 2 (48 finds, 7.8%)
#> 
#> - **Dating**: average 374 BCE (range: -650 to 550)
#> - **Material classes**: Materiale Da Costruzione (3), Reperti Archeozoologici (1), Reperto Anforaceo (1), Reperto Ceramico (42), Reperto Metallico (1)
#> - **Mean entropy**: 0.0000; **Mean energy**: 0.466
#> - **Contexts**: US_204, US_21, US_210, US_276
#> - **Assessment**: Entropy is very low, suggesting this phase is internally coherent and well-defined.
#> 
#> ### Phase 3 (121 finds, 19.7%)
#> 
#> - **Dating**: average 372 CE (range: 180 to 425)
#> - **Material classes**: Campionature (5), Intonaco Dipinto (1), Materiale Da Costruzione (15), Materiale Lapideo (1), Pietre Dure-Gemme (1), Reperti Antropologici (1), Reperti Archeozoologici (6), Reperto Anforaceo (12), Reperto Ceramico (47), Reperto Faunistico (1), Reperto Lapideo (5), Reperto Metallico (16), Reperto Vitreo (3), Resti Di Pavimentazioni (1), Scarti Di Produzione (5), Stucco (1)
#> - **Mean entropy**: 0.0001; **Mean energy**: 0.971
#> - **Contexts**: US_173, US_287, US_296, US_297, US_299, US_302, ... (16 total)
#> - **Assessment**: Entropy is very low, suggesting this phase is internally coherent and well-defined.
#> 
#> ### Phase 4 (320 finds, 52.0%)
#> 
#> - **Dating**: average 438 CE (range: 425 to 450)
#> - **Material classes**: Materiale Da Costruzione (13), Reperto Anforaceo (37), Reperto Ceramico (260), Reperto Numismatico (3), Reperto Vitreo (7)
#> - **Mean entropy**: 0.0029; **Mean energy**: 0.820
#> - **Contexts**: US_102, US_104, US_105, US_108, US_111, US_114, ... (15 total)
#> - **Assessment**: Entropy is very low, suggesting this phase is internally coherent and well-defined.
#> 
#> ## Intrusion Detection
#> 
#> **10 finds** (1.6%) have an intrusion probability above 0.5, flagging them as potentially displaced or redeposited.
#> 
#> Top suspects:
#> 
#> | ID | Context | Intrusion Prob. |
#> |---|---|---|
#> | VRPG_0181 | US_116 | 0.621 |
#> | VRPG_0489 | US_276 | 0.593 |
#> | VRPG_0180 | US_116 | 0.591 |
#> | VRPG_0431 | US_204 | 0.578 |
#> | VRPG_0417 | US_173 | 0.568 |
#> | VRPG_0155 | US_114 | 0.551 |
#> | VRPG_0157 | US_114 | 0.551 |
#> | VRPG_0162 | US_114 | 0.526 |
#> | VRPG_0173 | US_114 | 0.526 |
#> | VRPG_0174 | US_114 | 0.526 |
#> 
#> ## Stratigraphic Unit Purity
#> 
#> Of **54 stratigraphic units**: **54 pure** (>90%), **0 mixed** (70-90%), **0 palimpsest** (<70%).
#> 
#> 
#> ## Recommendations
#> 
#> The deposit shows good phase separation. The model can be used with confidence for phase attribution. Focus review on flagged intrusions and boundary zones.
#> 
#> ---
#> *Report generated by palimpsestr 0.24.0*
report_sef(fit, lang = "it")
#> # Rapporto Analisi SEF
#> 
#> ## Panoramica
#> 
#> Il modello Stratigraphic Entanglement Field con **K = 4 fasi** e' stato applicato a un dataset di **615 reperti** distribuiti in **54 contesti stratigrafici**.
#> 
#> Il Palimpsest Dissolution Index (PDI) e' **0.999**, indicando una separazione **eccellente** delle fasi. Entropia media: 0.0020. Energia media: 0.882.
#> 
#> Diagnostiche: LogLik = 3715.8, BIC = -6744.6.
#> 
#> ## Composizione delle Fasi
#> 
#> ### Fase 1 (126 reperti, 20.5%)
#> 
#> - **Datazione media**: 438 d.C.
#> - **Classi materiali**: Materiale Da Costruzione (7), Reperto Anforaceo (26), Reperto Ceramico (68), Reperto Lapideo (1), Reperto Metallico (2), Reperto Numismatico (7), Reperto Vitreo (15)
#> - **Entropia media**: 0.0025; **Energia media**: 1.113
#> - **US**: US_118, US_136, US_137, US_144, US_151, US_157, ... (19 totali)
#> 
#> ### Fase 2 (48 reperti, 7.8%)
#> 
#> - **Datazione media**: 374 a.C.
#> - **Classi materiali**: Materiale Da Costruzione (3), Reperti Archeozoologici (1), Reperto Anforaceo (1), Reperto Ceramico (42), Reperto Metallico (1)
#> - **Entropia media**: 0.0000; **Energia media**: 0.466
#> - **US**: US_204, US_21, US_210, US_276
#> 
#> ### Fase 3 (121 reperti, 19.7%)
#> 
#> - **Datazione media**: 372 d.C.
#> - **Classi materiali**: Campionature (5), Intonaco Dipinto (1), Materiale Da Costruzione (15), Materiale Lapideo (1), Pietre Dure-Gemme (1), Reperti Antropologici (1), Reperti Archeozoologici (6), Reperto Anforaceo (12), Reperto Ceramico (47), Reperto Faunistico (1), Reperto Lapideo (5), Reperto Metallico (16), Reperto Vitreo (3), Resti Di Pavimentazioni (1), Scarti Di Produzione (5), Stucco (1)
#> - **Entropia media**: 0.0001; **Energia media**: 0.971
#> - **US**: US_173, US_287, US_296, US_297, US_299, US_302, ... (16 totali)
#> 
#> ### Fase 4 (320 reperti, 52.0%)
#> 
#> - **Datazione media**: 438 d.C.
#> - **Classi materiali**: Materiale Da Costruzione (13), Reperto Anforaceo (37), Reperto Ceramico (260), Reperto Numismatico (3), Reperto Vitreo (7)
#> - **Entropia media**: 0.0029; **Energia media**: 0.820
#> - **US**: US_102, US_104, US_105, US_108, US_111, US_114, ... (15 totali)
#> 
#> ## Intrusioni
#> 
#> **10 reperti** (1.6%) presentano probabilita' di intrusione superiore a 0.5.
#> 
#> ## Purezza delle US
#> 
#> Su **54 US**: **54 pure** (>90%), **0 miste** (70-90%), **0 palinsesto** (<70%).
#> 
#> ---
#> *Rapporto generato da palimpsestr 0.24.0*

Structured Exports

All results can be exported to CSV for integration with external databases and GIS:

export_results(fit, dir = "results/", format = "csv", prefix = "villa_romana")

This creates 4 files:

  • villa_romana_phases.csv — phase assignments and diagnostics per find
  • villa_romana_intrusions.csv — intrusion probabilities
  • villa_romana_us_summary.csv — per-US aggregated statistics
  • villa_romana_model_summary.csv — model-level metrics (n, k, PDI, BIC, etc.)

GIS Export

For integration with GIS workflows, palimpsestr can export results as sf spatial objects:

if (requireNamespace("sf", quietly = TRUE)) {
  sf_pts   <- as_sf_phase(fit, crs = 32632L, dims = "XYZ")
  sf_links <- as_sf_links(fit, quantile_threshold = 0.90, crs = 32632L)
  sf::st_write(sf_pts,   "villa_romana.gpkg", layer = "phases")
  sf::st_write(sf_links, "villa_romana.gpkg", layer = "sei_links")
}

Interactive Visualization

When plotly is available, any gg_* plot can be converted to an interactive visualization using as_plotly():

as_plotly(gg_phasefield(fit))
as_plotly(gg_intrusions(fit))

Comparison with Traditional Approaches

Aspect Harris Matrix k-means palimpsestr (SEF)
Evidence domains Stratigraphy only Spatial only Spatial + vertical + temporal + cultural
Assignment type Deterministic Hard Probabilistic (soft)
Mixing detection No No Yes (entropy, ESE)
Intrusion detection Manual No Automatic
Scalability Manual Good Good
Interpretability High Low High (PDI, reports)

The SEF framework does not replace the Harris Matrix. Rather, it extends it. The Harris Matrix remains indispensable for documenting observed physical relationships between depositional units. What palimpsestr adds is a quantitative, probabilistic layer of analysis that operates on find-level data within those units — precisely the scale at which palimpsest formation is most problematic.

Setup recommendations (v0.13.0)

Real-world archaeological datasets often have subtle structural properties that strongly affect which fit_sef() flags should be enabled — for example, coordinates assigned at the unit centroid rather than per find, chronology inherited from the unit’s archaeological period, highly imbalanced class distributions, or informative taphonomic scores. The recommend_setup() helper inspects a dataset and returns the suggested flags plus diagnostic warnings:

rec <- recommend_setup(villa_romana, context = "context", tafonomy = "taf_score")
print(rec)
#> <sef_setup>
#> 
#> Diagnostics:
#>   Number of classes:        17
#>   Singleton classes:        7
#>   Largest class proportion: 67.8%
#>   Coords at US centroid:    TRUE
#>   Chronology US-tied:       TRUE
#>   Taf informative (range >= 0.2): TRUE (range = 0.70)
#>   Tspan variability (CV):   1.21
#> 
#> Recommended fit_sef() flags:
#>   class_scale:       TRUE
#>   taf_as_feature:    TRUE
#>   chrono_precision:  TRUE
#>   residuality:       FALSE
#> 
#> Warnings:
#>   - Coordinates appear to be at US-centroid level (53 unique XY, 54 unique units): spatial discrimination within a unit is not possible.
#>   - Chronology is US-tied (all finds in a unit share identical date_min/date_max): detect_intrusions() direction will be 'in_context' for all rows. Per-find typological dating is required for directional diagnostics.
#> 
#> Suggested call:
#> fit_sef(
#>   data = your_data,
#>   k = 4,
#>   context = "context",
#>   tafonomy = "taf_score",
#>   class_scale = TRUE,
#>   taf_as_feature = TRUE,
#>   chrono_precision = TRUE
#> )

For villa_romana the helper detects that (i) coordinates are at unit centroid, (ii) chronology is US-tied (every find inherits its unit’s period), (iii) taphonomic scores are informative (range 0.30–1.00), and that the empirically optimal setup is class_scale = TRUE plus taf_as_feature = TRUE. The same call also issues an explicit warning that detect_intrusions() direction will be in_context for every row on this dataset — directional diagnostics require per-find typological dating, not unit-level chronology.

Directional intrusion diagnostics (v0.13.0)

fit_vr <- fit_sef(villa_romana, k = 4, context = "context",
                  tafonomy = "taf_score",
                  class_scale = TRUE, taf_as_feature = TRUE)
intr <- detect_intrusions(fit_vr)
table(intr$direction, useNA = "ifany")
#> 
#>   older_than_context           in_context younger_than_context 
#>                    0                  600                    0 
#>                 <NA> 
#>                   15
head(intr[intr$direction == "younger_than_context", ])
#>        id intrusion_prob direction chrono_gap intrusion_type
#> NA   <NA>             NA      <NA>         NA           <NA>
#> NA.1 <NA>             NA      <NA>         NA           <NA>
#> NA.2 <NA>             NA      <NA>         NA           <NA>
#> NA.3 <NA>             NA      <NA>         NA           <NA>
#> NA.4 <NA>             NA      <NA>         NA           <NA>
#> NA.5 <NA>             NA      <NA>         NA           <NA>

The direction column distinguishes older-than-context finds (residuals redeposited from earlier deposits) from younger-than-context finds (latent intrusions, often signalling unrecognised cuts and refills). The chrono_gap column gives the signed mismatch in years.

Type longevity (v0.13.0)

tl <- type_longevity(fit_vr)
head(tl[order(-tl$longevity_span), c("class", "longevity_min", "longevity_max", "longevity_span", "n_finds")])
#>                      class longevity_min longevity_max longevity_span n_finds
#> 1         Reperto Ceramico          -650           550           1200     417
#> 7  Reperti Archeozoologici          -650           550           1200       7
#> 2        Reperto Anforaceo           180           450            270      76
#> 3 Materiale Da Costruzione           180           450            270      38
#> 4           Reperto Vitreo           180           450            270      25
#> 6        Reperto Metallico           180           450            270      19
if (requireNamespace("ggplot2", quietly = TRUE)) gg_longevity(tl, fit_vr)
Posterior-weighted longevity envelope per cultural class.

Posterior-weighted longevity envelope per cultural class.

Long-lived classes (e.g. tegula, dolium) span multiple phases with non-negligible posterior weight; short-lived classes are confined to a single phase.

Calibrated radiocarbon and OxCal import (v0.13.0, v0.19.0)

cal <- rcarbon::calibrate(x = c(2500, 2400, 2350), errors = c(30, 30, 30), verbose = FALSE)
chrono_df <- chronology_from_rcarbon(cal, method = "hpd")
chrono_df
#>      id date_min date_max date_mid
#> 1 cal_1     -775     -519   -647.0
#> 2 cal_2     -730     -398   -564.0
#> 3 cal_3     -535     -380   -457.5
# Merge into your main data:
# my_data <- merge(my_data, chrono_df, by = "id")
# fit_sef(my_data, ...)

The adapter takes calibrated probability densities from rcarbon::calibrate() and returns the chronological-interval columns (date_min, date_max, date_mid) used by fit_sef(). Three reduction methods are available: HPD (default), median+IQR, and weighted mean +/- 2 SD.

Since v0.19.0, chronology_from_oxcal() provides the same service for OxCal-modelled dates. It accepts either an oxcAAR::oxcalCalibrate() result (reduced by the same three methods) or a generic data.frame of calibrated start/end ranges exported from OxCal:

ranges <- data.frame(id = c("a", "b"), start = c(-700, -50), end = c(-600, 100))
chronology_from_oxcal(ranges)
#>   id date_min date_max date_mid
#> 1  a     -700     -600     -650
#> 2  b      -50      100       25

Model Limitations and Methodological Notes

Diagonal Covariance Assumption

The EM uses diagonal Gaussians, assuming conditional independence between features within each phase. This is a deliberate parsimony choice: for \(p\) features and \(k\) phases, diagonal covariance requires \(k(2p+1)-1\) parameters versus \(k(p(p+3)/2+1)-1\) for full covariance. With typical archaeological datasets (n = 50–500, p = 7–10), full covariance risks overfitting.

SEI Component Normalisation

Each SEI component is normalised to [0, 1] by dividing by its within-dataset maximum. This makes weights interpretable but means absolute SEI values are not comparable across datasets. Cross-site comparisons should use derived statistics (PDI, mean entropy) which are scale-invariant.

PDI Interpretation

The PDI is a descriptive measure, not a formal test statistic. The interpretive guideline (PDI > 0.7 = well-resolved) is empirically derived from simulation studies and should not be treated as a formal decision boundary. Bootstrap confidence intervals provide uncertainty quantification.

Intrusion Detection

The intrusion probability score is a heuristic combining rescaled entropy, ESE, and inverse local SEI. The 0.5 threshold is an exploratory guideline. Users should examine flagged observations individually using domain knowledge.

Phase Labelling

Phase numbers are arbitrary in mixture models (label switching). Use reorder_phases() to ensure phase 1 = deepest (oldest) stratum. The bootstrap procedure applies phase reordering to each replicate.

Cross-Validation

The cv_sef() and optimize_weights() functions standardise test data using the training set’s center and scale, ensuring consistent feature scales across folds.

Scalability

The SEI and ESE matrices are \(O(n^2)\) in memory. For datasets exceeding ~2000 finds, use sei_sparse() or the max_dist parameter in sei_matrix() to limit computation to spatial neighbours.

Interactive Shiny Dashboard

palimpsestr includes a built-in Shiny application for interactive analysis. The dashboard supports CSV upload, SQLite and PostgreSQL connections, and provides all diagnostic plots, tables, and maps in a point-and-click interface.

launch_app()

The app has six tabs: Data Input (CSV/SQLite/PostgreSQL with column mapping), Analysis (parameter selection and model fitting), Results (interactive plots via plotly), Tables (phase assignments, intrusions, US purity), Maps (overlay on excavation plan geometries), and Report (interpretive report generation and CSV/ZIP download).

Advanced Feature Space Options

Version 0.12.0 introduces five optional improvements to the feature space that address limitations observed in real-data testing (Poggio Gramignano villa, Solarolo terramare).

Chronological Precision Weighting

Finds with precise dating (narrow date range) should contribute more to temporal clustering than finds with generic dating. When chrono_precision = TRUE, the feature 1/tspan is added, giving higher weight to precisely dated finds.

# Finds dated "BM2b" (50-year range) weight more than "BM" (300-year range)
fit <- fit_sef(data, k = 3, chrono_precision = TRUE)

Residuality Detection

When a find’s dating is inconsistent with its stratigraphic context (e.g., Late Antique ceramic in a Pre-Roman pit fill), the residuality feature captures this mismatch. Requires context.

# Adds abs(tmid_find - tmid_context_mean) / max_range as a feature
fit <- fit_sef(data, k = 4, context = "context", residuality = TRUE)

Class Scaling

By default, one-hot encoded class columns (0/1) are not scaled, causing them to dominate the clustering when many classes are present. class_scale = TRUE multiplies one-hot columns by \(1/\sqrt{n_{classes}}\), putting their total energy on par with a single numeric feature.

fit <- fit_sef(data, k = 4, class_scale = TRUE)

Taphonomic Score as Feature

In addition to using taf_score as a weight in the EM algorithm, it can be added as an explicit feature dimension, so that finds with similar taphonomic conditions cluster together.

fit <- fit_sef(data, k = 4, tafonomy = "taf_score", taf_as_feature = TRUE)

Sub-class Encoding

When finer typological distinctions are available (e.g., “ceramic_impasto” vs “ceramic_sigillata” instead of just “ceramic”), the subclass parameter uses them for one-hot encoding without changing the main class column.

fit <- fit_sef(data, k = 4, subclass = "subtype")

Using All Five Together

fit <- fit_sef(data, k = 4,
               tafonomy = "taf_score", context = "context",
               chrono_precision = TRUE,
               taf_as_feature = TRUE,
               residuality = TRUE,
               class_scale = TRUE,
               subclass = "subtype")

Conclusion

The palimpsestr package offers a unified probabilistic framework for decomposing archaeological palimpsests, addressing a long-standing gap between qualitative stratigraphic reasoning and quantitative spatial analysis. By jointly modelling spatial proximity, vertical distribution, chronological overlap, and cultural similarity, the Stratigraphic Entanglement Field approach produces soft phase assignments that acknowledge — rather than suppress — the inherent uncertainty of mixed deposits. The three diagnostic statistics (SEI, ESE, PDI) provide interpretable, actionable summaries at the find, deposit, and assemblage levels respectively.

The package is available on GitHub at https://github.com/enzococca/palimpsestr.

# install.packages("remotes")
remotes::install_github("enzococca/palimpsestr")