3) Safety-first

Safety is a first-class concern: layered recognition, clear escalation gates, human oversight, and transparent boundaries. The assistant uses language to notice overload, risk, and readiness for reflection, then changes its stance accordingly rather than treating safety as a single on/off switch.

Escalation gates (examples)

Direct self-harm language or intent signals -> immediate human route and crisis info.
Clear acute danger wording -> stop reflection and move to immediate safety signposting.
Repeated high-risk turns within a session -> contain, ground, end gently, and notify a human reviewer.

Oversight and audit

Supervision notes linked to anonymised transcripts.
Automatic logging of escalation decisions and handovers.
Data minimisation and purpose limitation by default.

Safety as relational containment, not surveillance

Safety here is enacted through relational containment: how the assistant slows, clarifies, redirects, or closes a thread when language suggests strain. It reads for intensity and direction, not diagnosis. Containment comes before interpretation.

Shorter replies and fewer questions when arousal rises.
Clear choices rather than open prompts.
Explicit permission to pause, stop, or change topic.

Safety is multi-layered, not one mechanism

In this project, safety is not only about detecting crisis language. It is a multi-layered concept: reading the language for signs of overload or danger, deciding whether the conversation needs containment or escalation, and also recognising when the person may be steady enough for careful reflective work.

Layer 1: recognise distress, agitation, shutdown, guilt themes, or direct risk wording in the dialogue.
Layer 2: choose the right stance - containment, support-offer, crisis routing, or bounded reflection.
Layer 3: bring in human oversight and real-world support when the assistant should not carry the interaction alone.

This means the same monitoring logic can do two jobs: trigger escalation when risk rises, and encourage reflection when the language suggests that reflection is appropriate rather than destabilising.

Language recognition now, neural models later

The immediate version can work through relatively transparent language recognition: keywords, repeated phrases, moral injury themes, direct safety language, and shifts in tone or coherence. That is enough to support early containment and escalation logic.

Longer term, neural networks or sequence-aware models could make this richer by noticing patterns across time rather than single turns: loops, changes in coherence, persistent blame themes, or signs that the person is moving from dysregulation toward reflective capacity.

Even then, the purpose stays the same. These models are not there to diagnose the user. They are there to improve timing: when to slow down, when to escalate, and when it may be safe and useful to invite reflection.

Safety modes (state-dependent behaviour)

Rather than a single safe response, the system shifts between modes. Mode changes are explained to the user for transparency and trust, and the user can always choose to pause.

Mode	Trigger	Behaviour
Normal	Stable language	Open dialogue, gentle reflection.
Containment	Elevated arousal	Grounding, reduced pace, reduced cognitive load.
Support-offer	Persistent distress or moral strain	Explicit offer of human help, plus bounded reflective support where appropriate.
Crisis	Imminent risk language	Crisis script, signposting, clear limits, no ambiguity.

Different from other methods, but usable alongside them

This differs from more conventional methods because it uses the dialogue itself as the monitoring surface. Instead of relying only on scheduled checklists, one-off triage, or fixed scripts, the chatbot can adapt turn by turn as the language changes.

That does not mean it replaces existing approaches. It can work in conjunction with TRiM-style check-ins, clinician review, formal screening tools, or partner-led escalation pathways.

Alongside structured check-ins, it offers a more continuous reading of how the conversation is moving.
Alongside human care, it can scaffold, prepare, and signpost rather than replace judgment.
Alongside reflective methods, it may help identify when reflection is appropriate and when regulation needs to come first.

What the system will not do

Safety is also about restraint. The assistant will not push trauma narrative or exposure, will not make promises, and will not present itself as the only support available. Ending safely is treated as success, not failure.

No diagnosis, no clinical certainty, no false reassurance guarantees.
No moralising or debate around guilt, responsibility, or blame.
No continuing a session if language indicates imminent harm.

Human-in-the-loop escalation (practical boundaries)

Human oversight is situational, not constant. When thresholds are crossed, the system routes to a human reviewer with minimal, relevant context.

Shared

Minimal excerpt (closest relevant turns)
Escalation reason and mode
Timestamp and language pack

Not shared

Full conversation history by default
Speculative profiling or diagnoses
Any non-essential personal data

Locale-specific crisis pathways

Crisis responses are language- and region-sensitive. The assistant does not assume location unless the user has chosen a language pack. It offers options, not orders.

"I'm concerned about your safety right now. I can't help with this alone. If you're in the UK, Combat Stress is available 24/7. If you're elsewhere, I can help you find local support."

For demo pages, it is fine to keep helplines as placeholders until you finalise verified partners.

Audit without extraction

Logs exist to review system behaviour, not to label users. This supports minimisation, proportionality, and ethical accountability.

Logged

Mode changes (Normal to Containment to Crisis)
Crisis script activation
Handover prompts and outcomes

Not logged

Diagnostic labels
Speculative emotional tagging
Long-term behavioural profiling

Why this matters

Safety-first here is not an attempt to prevent all harm, which would be impossible. It is an attempt to reduce avoidable harm, preserve dignity under strain, and ensure automated presence never exceeds its ethical authority.

Back to Design Overview