An AI can take 1,000 public forum comments and return a detailed professional profile — employer, personality, political leanings, sometimes a real name. Here's what that means for litigation teams, fraud investigators, and anyone with clients who post on the internet.
Founder of Grayhaven
Founder of Grayhaven
A few weeks ago, Simon Willison published a post about a small experiment he ran. He pulled 1,000 public comments from any Hacker News user via the Algolia API — totally free, no authentication required — fed them into Claude Opus, and asked for a profile. What came back was detailed enough to feel unsettling: professional identity, working style, personality traits, technical interests, likely employer. In some cases it inferred the person's real name from links they'd posted to personal projects.
All of that information was public. Every comment was public. The API is public. The model is commercial off-the-shelf.
Willison described it as "startlingly effective" and acknowledged it felt "a little creepy." I think that's exactly the right reaction. And I think most lawyers and insurance professionals haven't sat with it long enough.
Here's the thing about public data: any single piece of it is basically harmless. Someone comments on a forum about a software library they use at work. Someone else posts about a frustrating deposition experience. A third person mentions their city and their specialty in the same thread. None of it is sensitive on its own.
The problem is aggregation. A human researcher combing through a thousand scattered posts would take days and still miss patterns. An LLM with access to those same posts can synthesize them in minutes. The individual pieces don't change. What changes is the cost and speed of assembly.
This isn't a new concept — data brokers have been aggregating public records for decades. I work at LexisNexis Risk Solutions, which is literally in the business of combining public data to assess risk. The difference now is that LLMs can do the same thing with unstructured data — forum posts, comment threads, published articles, social media — not just clean structured records. That's a qualitatively different capability.
Think about the discovery and prep workflow for a complex piece of litigation. You have opposing counsel. You have witnesses. You have expert witnesses who get paid to testify on certain topics and have done so dozens of times before. And, depending on jurisdiction and applicable rules, you may have jurors.
All of these people leave digital footprints. Published articles, conference talks, LinkedIn posts, Reddit comments, Twitter threads, quotes in news stories. Expert witnesses especially — they often have years of publicly documented opinions on contested topics.
An AI-powered OSINT workflow can surface:
I'm not suggesting anything improper here. Most jurisdictions have established rules about juror research, and ethical obligations govern how intelligence can be used. The point is that this capability exists, the tooling is cheap, and if your opponents have access to the same models you do, they can run the same playbook on your witnesses and your clients.
Understanding what AI can surface about someone is no longer optional prep.
Fraud investigators have always worked with public data — social media posts of claimants supposedly bedridden but photographed hiking, that kind of thing. The challenge has always been the manual effort. You can only look at so many profiles before the work becomes unsustainable.
LLM-assisted OSINT changes the economics. A claims file that would have warranted maybe two hours of manual social media review can now get a systematic sweep of every accessible public post, sorted and summarized by relevance to the claimed injury and timeline. The model can flag contradictions, identify behavioral patterns, and surface lifestyle indicators that don't align with the claimed disability.
This is already happening. The question for insurers isn't whether to use these tools — it's how to build workflows around them that are defensible, documented, and compliant with applicable regulations on claims investigation. Using AI to identify leads for human investigators is very different from making coverage decisions algorithmically based on a profile.
The distinction matters. Build the workflow with that distinction explicit from the start.
Everything I've described above runs in both directions.
Your clients post publicly. Your named partners post publicly. Your expert witnesses post publicly. What does a systematic AI aggregation of their digital footprint look like to opposing counsel? To a regulatory investigator? To a journalist working on a story?
This is a conversation most firms haven't had with clients, and probably should. Not to create anxiety, but because the practical implications are real:
None of this requires bad intent from anyone. It's just what happens when you have years of public statements and someone builds an efficient tool to aggregate them.
I'm not going to tell you to lock down every social media account and stop existing on the internet. That's not realistic and the disclosure risk is usually lower than the business development value of being present online.
What I would suggest:
For litigation teams: build AI-powered OSINT into your witness prep and opposing-side research as a standard step, not an ad hoc thing you do when someone remembers to ask. This doesn't require a six-month technology initiative. It requires a workflow, a model with web access or a document ingestion pipeline, and someone accountable for running it.
For insurance carriers: treat AI-assisted public data review as a tool for generating investigator leads, not for making coverage decisions. Document the workflow. Make sure your investigation standards explicitly address AI-sourced intelligence so you're not retrofitting policy when a case goes sideways.
For both: brief your clients. A simple advisory on digital footprint management — what to avoid posting during active matters, why consistency between public statements and sworn testimony matters — is something most clients haven't received and would genuinely value.
On the ethics: the capability being available doesn't make every use of it appropriate. There are professional conduct rules, privacy regulations, and jurisdiction-specific restrictions that govern what you can do with intelligence about witnesses, jurors, and opposing parties. The answer to "can AI profile this person from public data" is almost always yes. Whether you should in a given context is a different question, and one you need to answer before you run the workflow, not after.
The Hacker News profiling experiment Willison described was a demonstration, not a production tool. But it illustrated something that practitioners in law and insurance need to take seriously: the barrier to building a capable, detailed profile of a person from public data is now very low.
I deal with this at LexisNexis every day in a structured context — public records, court filings, licensed data sources. What's changed is that the same synthesis capability now applies to the messy, unstructured layer of the internet. The forum posts. The comment threads. The things people wrote without thinking anyone would ever read them together.
That's a capability your profession needs to understand, whether you intend to use it or not. Because the question of whether you should use it only matters if you know what it can do.
If you're thinking through how AI fits into your litigation prep workflow, your claims investigation process, or your client advisory on digital footprint risk, I'm happy to talk through what's practical. Drop me a line.
More insights on AI engineering and production systems
We help legal, insurance, and compliance teams implement AI that saves time and reduces risk. Let's talk about your needs.