1.Abstract
This case study details the pipeline of extracting, structuring, and validating public sentiment data surrounding privacy and safety concerns related to AI-powered wearable devices. Discussions around Meta Ray-Ban smart glasses intensified in early 2026 as users, researchers, and online communities debated issues such as surveillance, consent, data collection, and transparency in AI training practices.
To analyze this broader conversation, the project integrated Vivly to autonomously identify signals and Aquin to rigorously inspect the dataset. While the retrieval pipeline surfaced a much larger corpus of wearable AI and privacy-related discussions, this case study specifically focuses on conversations related to Meta Ray-Ban smart glasses. The final result is a curated 1,500-entry training dataset sourced from Reddit and Hacker News that captures public discussions around consumer trust, privacy expectations, passive recording concerns, and AI-enabled wearable technology.
2.Background
The dataset was collected during a period of heightened public discussion around privacy practices associated with Meta Ray-Ban glasses. Conversations across online communities focused on topics such as passive recording, AI training transparency, third-party data handling, and how wearable AI devices may affect expectations of privacy in public and personal spaces.
One incident that intensified these discussions involved investigative reports alleging that contractors associated with AI data labeling operations reviewed user-generated video and audio clips captured through the glasses. The reports raised broader concerns around informed consent, transparency in AI training workflows, data handling practices, and the privacy implications of wearable AI systems.
Following public backlash and regulatory scrutiny, Meta reportedly paused and later ended parts of its collaboration with Sama, a Kenya-based data annotation company involved in the workflow. The controversy also contributed to public debate around AI governance, consumer trust, and responsible deployment of always-on wearable technologies.
3.Architecture
This project utilizes two primary platforms to process the unstructured data into a secure training set.
1. Data Acquisition and Structuring: Vivly
Vivly is a signal identification platform for public and social data. It surfaces meaningful signals from large scale discussions, helping enterprises understand exactly what is being discussed, by whom, and why it matters.
The project used the Vivly SDK, available via pip and npm, to fetch relevant discussions around the Meta Ray-Ban privacy controversy.
2. Dataset Validation and Compliance: Aquin
Aquin is a platform dedicated to building, inspecting, and improving artificial intelligence models, especially large language models. It focuses on peering into how models work internally to ensure they are reliable, safe, and accurate before deployment.
Because the dataset contained raw internet reactions to a highly sensitive privacy controversy, it required thorough sanitization before being used for training and analysis. For this, we used Aquin's Dataset Inspector, which ingests raw social data and processes it through a safety and compliance framework designed for artificial intelligence datasets. The platform performed the following critical checks:
4.Data Preparation
4.1 Data Sources
To capture authentic public reactions to Meta Ray-Ban smart glasses, we sourced data from two primary platforms: Reddit and Hacker News. These sites host some of the most unfiltered debates on emerging tech and privacy.
4.2 Data Extraction
We used the Vivly CLI to automate the extraction.
$ vivly route \ "Meta Ray-Ban wearable AI privacy discussions and recording concerns" \ --reddit --hackernews \ --items=1500 --format=jsonl
We queried the Vivly SDK with the above prompt, and it analysed the intent and identified the specific communities actively discussing it.
Using the Vivly SDK, the pipeline initially surfaced several thousand public discussions across Reddit and Hacker News related to wearable AI, privacy expectations, surveillance concerns, and Meta Ray-Ban smart glasses. After multiple filtering and validation stages designed to remove noise, duplicates, and low-relevance entries, the dataset was refined into a curated corpus of 1,500 high-signal discussions for downstream analysis and experimentation.
4.3 Result
{
"id": "1stq3ct",
"url": "https://www.reddit.com/r/privacy/comments/1stq3ct/...",
"score": 479,
"title": "Being recorded with meta glasses during work",
"content": "Today I was doing my job at a restaurant...",
"subreddit": "privacy",
"created_date": "2026-04-23T17:51:33+00:00",
"num_comments": 227,
"comments": [
{
"id": "ohv9lnq",
"body": "Mention it to bosses as it has to be addressed in some standard
yet inoffensive way for staff - that you can politely decline to
be recorded more than a couple of seconds, say.",
"score": 351,
"depth": 0,
"created_utc": 1776968911.0
}
]
}5.Data Processing
5.1 Dataset Preparation for Aquin
The raw JSON data extracted from Reddit and Hacker News was deeply valuable but far too unstructured for direct model training
The key step here was using Claude Sonnet 4.6 not to generate content, but to restructure it.
The model analyzed the raw data and logically grouped scattered discussions based on shared article links and core topics. This preserved the contextual richness of the human conversations while organizing them into coherent, unified threads.
This consolidation step compressed approximately 1,500 individual discussions into 296 structured conversational rows while preserving the core semantic context and sentiment patterns present across the source material. The resulting structure was significantly more efficient for downstream inspection, clustering, and compliance analysis.
Once the discussions were logically grouped, the data was passed through a lightweight formatting script. This step required no additional AI processing. The script converted the grouped data into a strict, LLaMA-compatible prompt-and-answer format. The output was a clean JSON Lines (JSONL) file structured to match the ingestion requirements of Aquin's Dataset Inspector.
Finally, the formatted JSONL file was uploaded into Aquin, where the Dataset Inspector automatically processed the entries through its predefined evaluation pipelines.
6.Process Views
A selection of views from the dataset inspector, audit surfaces, and pipeline output across each stage of the project.
6.1 Prompt Injection Scan
CleanThis process scans the dataset's training rows to detect embedded prompt injection patterns. It specifically looks for inputs designed to hijack the AI by overriding its primary instructions.
The dataset was analyzed (296 rows) and returned a completely "Clean" verdict. Zero rows were flagged, and the average injection score was an incredibly low 0.0037, meaning the data is secure from basic injection attacks.
Flagged
0%
0 rows
High Conf
0
≥ 0.88
Verdict
Clean
No prompt injection patterns detected
6.2 Opt-Out and Consent Registry
ClearThis step checks any web links (URL columns) present in the dataset against the Spawning AI opt-out registry and standard robots.txt restrictions. This ensures the data respects creator consent and legal scraping boundaries.
The status is entirely "Clear." The scanner detected zero URL columns in this specific dataset, meaning no domains were blocked and no further opt-out compliance checks were required.
Status
Clear
Opted-Out URLs
0
Domains Blocked
0
URLs Checked
0
Domains Checked
0
Clear URLs
0
No URL columns detected.
6.3 Bias Surface and Fairness Analysis
Low RiskThis analysis detects protected attributes, such as gender, race, or age, and measures label imbalances. The goal is to ensure the dataset is fair, balanced, and won't train the AI to exhibit discriminatory behavior.
The bias risk was marked as "Low." The system detected zero protected attributes and zero label columns across the 296 rows, concluding that there are no significant bias signals or fairness concerns.
Protected Attrs
0
detected
Label Columns
0
analysed
Bias Risk
Low
Summary Flags
No protected attributes or label columns detected
6.4 Toxicity Analysis
CleanThis scan evaluates the dataset for harmful, offensive, or inappropriate language. It identifies toxic rows, provides a severity breakdown, and pins the worst offenders for manual review.
The overall verdict is "Clean." While 4.7% of the data (14 rows) was flagged for minor toxicity, only 1 single row was classified as "severe" (scoring ≥ 0.8). The vast majority of the sample remains safe.
Flagged
4.7%
14 rows
Severe
1
≥ 0.8
Overall
CLEAN
Toxicity Distribution
6.5 System Prompt Leak & Role Confusion Check
CleanBased on the secondary prompt injection scan you uploaded, this specific check digs deeper into adversarial attacks that attempt to cause "role confusion" or trick the AI into leaking its confidential backend system prompts.
Just like the primary injection scan, this deep dive came back "Clean." The system confirmed a 0% flag rate for these advanced manipulation tactics in both the user and assistant columns.
Flagged
0%
0 rows
High Conf
0
≥ 0.88
Verdict
Clean
No system prompt leaks or role confusion patterns detected
6.6 Synthetic Content Detection
HumanThis process analyzes the text to determine if it was generated by an AI (synthetic) rather than written by a human. It scores the likelihood of AI origin and pinpoints the exact rows that look machine-generated.
The overall dataset is classified as "Human," with a very low average synthetic score of 0.1432. However, it did flag 0.7% of the data (2 rows) as highly synthetic: Row #178 (assistant) hit 100% synthetic confidence, and Row #138 (user) hit 90% confidence.
Synthetic
0.7%
2 rows
High Conf
2
≥ 0.9
Verdict
HUMAN
Score Distribution
Per-Column Breakdown
0.3% flagged · 1 high conf
0.3% flagged · 1 high conf
Flagged Rows (≥ 0.7 score)
6.7 Poisoned Sample Detection
CleanThis process analyzes the dataset to detect "poisoned" training samples, maliciously altered data meant to corrupt the AI's learning, by searching for cluster outliers, label inconsistencies, and loss anomaly signals.
Across the 296 rows analyzed, the dataset performed perfectly with a 0% flagged rate and a 0 high-confidence score, resulting in a completely "Clean" verdict. The average anomaly score remained extremely low at 0.1423.
A deeper look into the signal analysis confirmed that no significant cluster outliers were detected, no label inconsistencies were found, and zero loss anomalies were present. The dataset is currently free from any poisoned sample vulnerabilities.
Flagged
0%
0 rows
High Conf
0
≥ 0.8
Verdict
Clean
No poisoned samples detected
6.8 Copyright and License Risk Assessment
ElevatedThis analysis evaluates the dataset for potential intellectual property violations by calculating a composite IP score based on domain analysis, inline license signals, and copyrighted content markers.
Unlike the security scans, this scan flagged an "Elevated" overall risk, issuing a composite score of 46 out of 100. This elevated status is driven entirely by the lack of a declared license. Because no license is attached to the data, the system automatically assumes a "restricted" status, generating a high license risk score of 75.
Fortunately, the actual content analysis poses a very low risk (score: 0.0125). Across a 200-row sample, the system found 0% copyright notices, 0% open license references, and 0% book/publication markers. The only minor flag was that 2.5% of the rows contained "news wire phrases," but no direct copyrighted content was identified.
Overall Risk
ElevatedLicense
No license declared
Assumed restricted
Content Signals: 200 Rows
Copyright Notices
0%
Open License Refs
0%
News Wire Phrases
2.5%
Book / Pub Markers
0%
6.9 Privacy and PII Scan
MappedThis scan identifies Personally Identifiable Information (PII) across the dataset (names, contacts, locations) and reports which columns carry the highest exposure so teams know exactly what to address before production.
The scan surfaced 106 entities across 90 rows (30.4% of the dataset). The breakdown is exactly what you would expect from a global privacy story: 105 of those entities are nationality and religion mentions, the kind of contextual detail that makes social data valuable for understanding real public sentiment. The one concrete action item is a single phone number that appeared in a user comment, which is straightforward to redact.
The exposure is concentrated in the user column (29.7% of rows, 104 entities), while the assistant column is nearly clean (0.7% of rows, 2 entities). This distribution is typical for raw forum data. The scan has done its job: the team now knows exactly which rows to touch and which to leave alone.
PII Rows
30.4%
90 of 296
Entities
106
detected
Risk
HIGH
By Category
Entity Breakdown
PII Density Per Column
29.7% rows affected · 104 entities
0.7% rows affected · 2 entities
6.10 Text Quality and Duplication Analysis
Low DuplicationThis check evaluates the foundational quality of the dataset's text by analyzing the language distribution and scanning for exact or near-duplicate rows that could skew the AI's training.
The dataset showed exceptional text hygiene in this assessment. The language distribution is 100% English, meaning there are no mixed-language translation anomalies to account for.
Furthermore, the duplicate detection process (using a 0.85 Jaccard similarity threshold) confirmed that 100% of the 296 rows are clean. The system found 0% near-duplicates and 0 exact identical rows, resulting in a "Low Duplication" status.
Language Distribution
Clean Rows
100%
296 rows
Near-Dupes
0%
0 rows
Exact Dupes
0
Identical
6.11 Compliance Audit Trail Flags
Audit CompleteThe audit trail grades the dataset against established regulatory frameworks and produces a prioritized action list, so teams know exactly what to resolve before the dataset enters a training pipeline.
The audit assessed 5 clauses and returned a clear, prioritized picture. The 4 flagged items all trace back to the same root cause: the nationality and religion mentions identified in the PII scan. These are expected in any dataset built from a global privacy controversy, and now they are precisely mapped, which is exactly the output you need before production.
Critically, the dataset passed Section 9 (Sensitive Personal Data) outright, confirming the absence of financial records, health data, Aadhaar, and PAN numbers. The hard categories are clean. What remains is a single well-scoped remediation: address the nationality mentions and the one phone number, and the dataset clears the remaining flags.
1 dim · 5 clauses assessed
Flagged: all trace to PII (4)
Passed: sensitive categories clean (1)
6.12 Framework Scores and Remediation Plan
Roadmap ReadyThe final step translates the audit findings into framework scores and a concrete remediation roadmap, so the team leaves with a clear path to a production-ready dataset, not just a list of issues.
These are pre-remediation baseline scores for a raw social dataset. These are the expected starting point before a compliance pass, not a measure of the data's usefulness. The India DPDPA score of 62% reflects that the hardest compliance requirements (no financial, health, or biometric data) are already met. The EU AI Act and NIST AI RMF scores track directly to the nationality mentions and the single phone number, both of which are well-understood and fixable in one pass.
The pipeline produced this full compliance picture, mapped, scored, and prioritized, automatically. A team running this without Vivly and Aquin would have reached the same point after weeks of manual review. The remediation plan itself is three steps, all scoped, none ambiguous.
Pre-Remediation Baseline Scores
raw social data · before compliance passPII remediation scoped
Partial, hard rules met
PII remediation scoped
Three-step path to production
7.Conclusion
The challenge with social data has rarely been access. The real difficulty is turning noisy, inconsistent public discussions into datasets that are structured enough for downstream analysis and model development. Public forums contain sarcasm, reposts, fragmented context, and low-signal commentary that make reliable dataset construction difficult at scale.
This project explored a different approach. Using the Vivly SDK, the pipeline identified relevant communities discussing wearable AI privacy concerns and surfaced high-signal discussions related to Meta Ray-Ban smart glasses. After filtering and refinement, the dataset was narrowed into a curated set of 1,500 discussion entries aligned with the core themes of the case study.
The structured output was then passed through a Claude-assisted organization stage followed by Aquin's inspection pipeline. Across multiple automated inspection layers, the dataset showed strong structural consistency with minimal indicators of adversarial manipulation, synthetic amplification, or poisoned content.
Because the dataset was collected directly from public online forums, certain discussions still contained personally identifiable information and sensitive user-provided details originating from the source platforms themselves. Before any downstream training or experimentation, the next stage of the pipeline will focus on PII scrubbing, remediation, and compliance alignment to ensure the dataset is safer and more suitable for research use.
