#6 Knowledge-Guided Retrieval-Augmented Generation for Zero-Shot Psychiatric Data: Privacy Preserving Synthetic Data Generation

Score: 24.3 | Matched keywords: ai, deep learning, large language models, llm, retrieval-augmented

Detailed Summary (EN)

Read-like-fullpaper digest

This paper tackles To generate privacy-preserving synthetic data, we propose a zero-shot, knowledge-guided framework that leverages a Large Language Model (LLM) to “simulate” a patient completing a structured clinical assessment based on the Diagnostic and Statistical Manual of Mental Disorders, Fifth Edition (DSM-V) [12]. By employing Retrieval-Augmented Generation (RAG) [13], our model grounds its responses in a knowledge base of clinical criteria, enabling it to generate clinically plausible and coherent assessment data without being trained on or exposed to real patient records. In this work, we introduce a fundamentally different paradigm for psychiatric data synthesis that completely circumvents the data-dependency problem and its associated fidelity-utility-privacy paradox.

The core proposal is for psychiatric tabular data in which large language models (LLMs) are steered via Retrieval-Augmented Generation using the Diagnostic and Statistical Manual of Mental Disorders (DSM-5) and the International Classification of Diseases (ICD10). The resulting models were benchmarked against two state-of-theart deep learning models for synthetic tabular data generation, namely CTGAN and TVAE, both of which rely on real data and therefore entail potential privacy risks. We conducted experiments using different combinations of knowledge bases to generate privacy-preserving synthetic data.

The empirical case is built around The resulting models were benchmarked against two state-of-theart deep learning models for synthetic tabular data generation, namely CTGAN and TVAE, both of which rely on real data and therefore entail potential privacy risks. An ablation study shows that clinical retrieval reliably improves univariate and pairwise fidelity over a noretrieval LLM. CTGAN typically achieves the best marginals and multivariate structure, while the knowledge-augmented LLM is competitive on pairwise structure and attains the lowest pairwise error in separation anxiety and social anxiety. Privacy analyses indicate that the real data-free LLM yields modest overlaps and a low average linkage risk comparable to CTGAN, whereas TVAE exhibits extensive duplication despite a low k-map score.

The central reported finding is An ablation study shows that clinical retrieval reliably improves univariate and pairwise fidelity over a noretrieval LLM. Privacy analyses indicate that the real data-free LLM yields modest overlaps and a low average linkage risk comparable to CTGAN, whereas TVAE exhibits extensive duplication despite a low k-map score.

The paper also makes it clear that Privacy Preservation in Synthetic Health Data To mitigate the privacy risks associated with training generative models on sensitive data, Differential Privacy (DP) has become the gold standard [9]. They rely on models that have been pre-trained or fine-tuned on vast amounts of clinical data, which carries an implicit risk of memorization and patient information leakage [26]. This reliance makes them vulnerable to privacy risks and perpetuates data scarcity issues when no initial dataset is available. Overall, the paper is most convincing where its proposed method is directly supported by the reported comparisons, but the scope of the claim should still be read in light of the evaluation setup and stated limitations.

Final takeaway

Main takeaway: An ablation study shows that clinical retrieval reliably improves univariate and pairwise fidelity over a noretrieval LLM.
Most important supporting result: Privacy analyses indicate that the real data-free LLM yields modest overlaps and a low average linkage risk comparable to CTGAN, whereas TVAE exhibits extensive duplication despite a low k-map score.
Important caution: Privacy Preservation in Synthetic Health Data To mitigate the privacy risks associated with training generative models on sensitive data, Differential Privacy (DP) has become the gold standard [9].

Problem definition

To generate privacy-preserving synthetic data, we propose a zero-shot, knowledge-guided framework that leverages a Large Language Model (LLM) to “simulate” a patient completing a structured clinical assessment based on the Diagnostic and Statistical Manual of Mental Disorders, Fifth Edition (DSM-V) [12].
By employing Retrieval-Augmented Generation (RAG) [13], our model grounds its responses in a knowledge base of clinical criteria, enabling it to generate clinically plausible and coherent assessment data without being trained on or exposed to real patient records.
In this work, we introduce a fundamentally different paradigm for psychiatric data synthesis that completely circumvents the data-dependency problem and its associated fidelity-utility-privacy paradox.
To protect patient confidentiality, techniques like Differential Privacy (DP), where noise is injected during the learning process, are often applied during training of the generative model [9].

Core idea & method

for psychiatric tabular data in which large language models (LLMs) are steered via Retrieval-Augmented Generation using the Diagnostic and Statistical Manual of Mental Disorders (DSM-5) and the International Classification of Diseases (ICD10).
The resulting models were benchmarked against two state-of-theart deep learning models for synthetic tabular data generation, namely CTGAN and TVAE, both of which rely on real data and therefore entail potential privacy risks.
We conducted experiments using different combinations of knowledge bases to generate privacy-preserving synthetic data.

Actual findings

An ablation study shows that clinical retrieval reliably improves univariate and pairwise fidelity over a noretrieval LLM.
Privacy analyses indicate that the real data-free LLM yields modest overlaps and a low average linkage risk comparable to CTGAN, whereas TVAE exhibits extensive duplication despite a low k-map score.

How the conclusion was reached

Step 1 — Proposed approach: for psychiatric tabular data in which large language models (LLMs) are steered via Retrieval-Augmented Generation using the Diagnostic and Statistical Manual of Mental Disorders (DSM-5) and the International Classification of Diseases (ICD10).
Step 2 — Evaluation setup or comparison basis: The resulting models were benchmarked against two state-of-theart deep learning models for synthetic tabular data generation, namely CTGAN and TVAE, both of which rely on real data and therefore entail potential privacy risks.
Step 3 — Main reported evidence: An ablation study shows that clinical retrieval reliably improves univariate and pairwise fidelity over a noretrieval LLM.
Step 4 — Additional supporting or qualifying result: Privacy analyses indicate that the real data-free LLM yields modest overlaps and a low average linkage risk comparable to CTGAN, whereas TVAE exhibits extensive duplication despite a low k-map score.
Step 5 — Claim boundary / limitation: Privacy Preservation in Synthetic Health Data To mitigate the privacy risks associated with training generative models on sensitive data, Differential Privacy (DP) has become the gold standard [9].

Experimental setup & results

An ablation study shows that clinical retrieval reliably improves univariate and pairwise fidelity over a noretrieval LLM.
CTGAN typically achieves the best marginals and multivariate structure, while the knowledge-augmented LLM is competitive on pairwise structure and attains the lowest pairwise error in separation anxiety and social anxiety.
Privacy analyses indicate that the real data-free LLM yields modest overlaps and a low average linkage risk comparable to CTGAN, whereas TVAE exhibits extensive duplication despite a low k-map score.

Limitations & risks

Privacy Preservation in Synthetic Health Data To mitigate the privacy risks associated with training generative models on sensitive data, Differential Privacy (DP) has become the gold standard [9].
They rely on models that have been pre-trained or fine-tuned on vast amounts of clinical data, which carries an implicit risk of memorization and patient information leakage [26].
This reliance makes them vulnerable to privacy risks and perpetuates data scarcity issues when no initial dataset is available.
Our work diverges from this entire paradigm by eliminating the need for a real training dataset altogether.

상세 요약 (KO)

전체 논문 읽은 느낌 요약

이 논문에서는 개인 정보를 보호하는 합성 데이터를 생성하기 위해 LLM(대규모 언어 모델)을 활용하여 정신 장애 진단 및 통계 매뉴얼, 제5판(DSM-V) [12]을 기반으로 구조화된 임상 평가를 완료하는 환자를 "시뮬레이트"하는 제로샷 지식 기반 프레임워크를 제안합니다. RAG(Retrieval-Augmented Generation)[13]를 사용함으로써 우리 모델은 임상 기준의 지식 기반에 응답을 기반으로 하여 실제 환자 기록에 대한 교육을 받거나 노출되지 않고도 임상적으로 타당하고 일관된 평가 데이터를 생성할 수 있습니다. 이 연구에서 우리는 데이터 의존성 문제와 이와 관련된 충실도-효용-프라이버시 역설을 완전히 우회하는 정신의학적 데이터 합성을 위한 근본적으로 다른 패러다임을 소개합니다. 핵심 제안은 정신 장애 진단 및 통계 매뉴얼(DSM-5) 및 국제 질병 분류(ICD10)를 사용하여 검색 증강 생성을 통해 LLM(대규모 언어 모델)을 조정하는 정신과 테이블 형식 데이터에 대한 것입니다. 결과 모델은 합성 표 데이터 생성을 위한 두 가지 최신 딥 러닝 모델, 즉 CTGAN 및 TVAE에 대해 벤치마킹되었습니다. 두 모델 모두 실제 데이터에 의존하므로 잠재적인 개인 정보 보호 위험이 수반됩니다. 우리는 개인 정보를 보호하는 합성 데이터를 생성하기 위해 다양한 지식 기반 조합을 사용하여 실험을 수행했습니다. 경험적 사례는 다음을 기반으로 구축되었습니다. 결과 모델은 합성 표 데이터 생성을 위한 두 가지 최신 딥 러닝 모델, 즉 CTGAN 및 TVAE에 대해 벤치마킹되었습니다. 두 모델 모두 실제 데이터에 의존하므로 잠재적인 개인정보 보호 위험을 수반합니다. 절제 연구에 따르면 임상 검색은 비검색 LLM에 비해 단변량 및 쌍별 충실도를 안정적으로 향상시키는 것으로 나타났습니다. CTGAN은 일반적으로 최고의 한계 및 다변량 구조를 달성하는 반면, 지식 증강 LLM은 쌍 구조에서 경쟁력이 있고 분리 불안 및 사회적 불안에서 가장 낮은 쌍 오류를 달성합니다. 개인 정보 보호 분석에 따르면 실제 데이터가 없는 LLM은 CTGAN에 필적하는 적당한 중복과 낮은 평균 연결 위험을 생성하는 반면 TVAE는 낮은 k-map 점수에도 불구하고 광범위한 중복을 나타냅니다. 중앙 보고 결과는 절제 연구에 따르면 임상 검색이 비검색 LLM에 비해 단변량 및 쌍별 충실도를 안정적으로 향상한다는 것을 보여줍니다. 개인 정보 보호 분석에 따르면 실제 데이터가 없는 LLM은 CTGAN에 필적하는 적당한 중복과 낮은 평균 연결 위험을 생성하는 반면 TVAE는 낮은 k-map 점수에도 불구하고 광범위한 중복을 나타냅니다. 또한 이 문서에서는 합성 건강 데이터의 개인 정보 보호 민감한 데이터에 대한 생성 모델 교육과 관련된 개인 정보 보호 위험을 완화하기 위해 차등 개인 정보 보호(DP)가 최고의 표준이 되었음을 분명히 밝혔습니다[9]. 그들은 방대한 양의 임상 데이터에 대해 사전 훈련되거나 미세 조정된 모델에 의존하며, 이는 암묵적인 암묵적 위험과 환자 정보 유출의 위험을 수반합니다[26]. 이러한 의존으로 인해 개인 정보 보호 위험에 취약해지고 초기 데이터 세트를 사용할 수 없을 때 데이터 부족 문제가 지속됩니다. 전반적으로, 이 논문은 제안된 방법이 직접적으로 뒷받침되는 부분에서 가장 설득력이 있습니다. 그러나 청구 범위는 평가 설정 및 명시된 제한 사항을 고려하여 읽어야 합니다.

핵심 결론

주요 시사점: 절제 연구에 따르면 임상 검색은 비검색 LLM에 비해 단변량 및 쌍별 충실도를 안정적으로 향상시키는 것으로 나타났습니다.
가장 중요한 지원 결과: 개인 정보 보호 분석에 따르면 실제 데이터가 없는 LLM은 CTGAN에 필적하는 적당한 중복과 낮은 평균 연결 위험을 생성하는 반면, TVAE는 낮은 k-map 점수에도 불구하고 광범위한 중복을 나타냅니다.
중요한 주의 사항: 합성 건강 데이터의 개인 정보 보호 민감한 데이터에 대한 생성 모델 교육과 관련된 개인 정보 보호 위험을 완화하기 위해 차등 개인 정보 보호(DP)가 최고의 표준이 되었습니다[9].

문제 정의

개인 정보를 보호하는 합성 데이터를 생성하기 위해 우리는 LLM(대규모 언어 모델)을 활용하여 정신 장애 진단 및 통계 매뉴얼 제5판(DSM-V) [12]을 기반으로 구조화된 임상 평가를 완료하는 환자를 "시뮬레이트"하는 제로샷 지식 기반 프레임워크를 제안합니다.
RAG(Retrieval-Augmented Generation)[13]를 사용함으로써 우리 모델은 임상 기준의 지식 기반에 응답을 기반으로 하여 실제 환자 기록에 대한 교육을 받거나 노출되지 않고도 임상적으로 타당하고 일관된 평가 데이터를 생성할 수 있습니다.
이 연구에서 우리는 데이터 의존성 문제와 이와 관련된 충실도-효용-프라이버시 역설을 완전히 우회하는 정신의학적 데이터 합성을 위한 근본적으로 다른 패러다임을 소개합니다.
환자의 기밀성을 보호하기 위해 학습 과정에서 잡음이 주입되는 차등 프라이버시(DP)와 같은 기술이 생성 모델 훈련 중에 적용되는 경우가 많습니다[9].

핵심 아이디어/방법

정신 장애 진단 및 통계 매뉴얼(DSM-5) 및 국제 질병 분류(ICD10)를 사용하여 검색 증강 생성을 통해 LLM(대규모 언어 모델)을 조정하는 정신과 테이블 형식 데이터의 경우.
결과 모델은 합성 표 데이터 생성을 위한 두 가지 최신 딥 러닝 모델, 즉 CTGAN 및 TVAE에 대해 벤치마킹되었습니다. 두 모델 모두 실제 데이터에 의존하므로 잠재적인 개인 정보 보호 위험이 수반됩니다.
우리는 개인 정보를 보호하는 합성 데이터를 생성하기 위해 다양한 지식 기반 조합을 사용하여 실험을 수행했습니다.

실제 결과

절제 연구에 따르면 임상 검색은 비검색 LLM에 비해 단변량 및 쌍별 충실도를 안정적으로 향상시키는 것으로 나타났습니다.
개인 정보 보호 분석에 따르면 실제 데이터가 없는 LLM은 CTGAN에 필적하는 적당한 중복과 낮은 평균 연결 위험을 생성하는 반면 TVAE는 낮은 k-map 점수에도 불구하고 광범위한 중복을 나타냅니다.

결론이 나온 과정

1단계 — 제안된 접근 방식: 정신 장애 진단 및 통계 매뉴얼(DSM-5) 및 국제 질병 분류(ICD10)를 사용하여 검색 증강 생성을 통해 LLM(대규모 언어 모델)을 조정하는 정신과 테이블 형식 데이터용.
2단계 — 평가 설정 또는 비교 기준: 결과 모델은 합성 표 데이터 생성을 위한 두 가지 최신 딥 러닝 모델, 즉 CTGAN 및 TVAE에 대해 벤치마킹되었습니다. 두 모델 모두 실제 데이터에 의존하므로 잠재적인 개인정보 보호 위험을 수반합니다.
3단계 - 보고된 주요 증거: 절제 연구에 따르면 임상 검색은 비검색 LLM에 비해 단변량 및 쌍별 충실도를 안정적으로 향상시키는 것으로 나타났습니다.
4단계 — 추가 지원 또는 적격 결과: 개인 정보 보호 분석에 따르면 실제 데이터가 없는 LLM은 CTGAN에 필적하는 적당한 중복과 낮은 평균 연결 위험을 생성하는 반면 TVAE는 낮은 k-map 점수에도 불구하고 광범위한 중복을 나타냅니다.
5단계 - 청구 경계/제한: 합성 건강 데이터의 개인 정보 보호 민감한 데이터에 대한 생성 모델 교육과 관련된 개인 정보 보호 위험을 완화하기 위해 DP(차등 개인 정보 보호)가 최고의 표준이 되었습니다[9].

실험 설정/결과

절제 연구에 따르면 임상 검색은 비검색 LLM에 비해 단변량 및 쌍별 충실도를 안정적으로 향상시키는 것으로 나타났습니다.
CTGAN은 일반적으로 최고의 한계 및 다변량 구조를 달성하는 반면, 지식 증강 LLM은 쌍 구조에서 경쟁력이 있고 분리 불안 및 사회적 불안에서 가장 낮은 쌍 오류를 달성합니다.
개인 정보 보호 분석에 따르면 실제 데이터가 없는 LLM은 CTGAN에 필적하는 적당한 중복과 낮은 평균 연결 위험을 생성하는 반면 TVAE는 낮은 k-map 점수에도 불구하고 광범위한 중복을 나타냅니다.

한계/리스크

합성 건강 데이터의 개인 정보 보호 민감한 데이터에 대한 생성 모델 교육과 관련된 개인 정보 보호 위험을 완화하기 위해 차등 개인 정보 보호(DP)가 최고의 표준이 되었습니다[9].
그들은 방대한 양의 임상 데이터에 대해 사전 훈련되거나 미세 조정된 모델에 의존하며, 이는 암묵적인 암묵적 위험과 환자 정보 유출의 위험을 수반합니다[26].
이러한 의존으로 인해 개인 정보 보호 위험에 취약해지고 초기 데이터 세트를 사용할 수 없을 때 데이터 부족 문제가 지속됩니다.
우리의 작업은 실제 훈련 데이터 세트의 필요성을 완전히 제거함으로써 이 전체 패러다임에서 벗어납니다.