LLM-Powered Argument Mining in French Media
Abstract
This presentation walks through an end-to-end pipeline for mapping controversies in a 5,816-article French press corpus on advertising and media, built on Gemma-4 31B with thinking mode and a 256K-token context. Stage 1 casts the model as a Latourian sociologist and extracts 5–7 claims per article; Stage 2 embeds the 34,867 resulting claims with BGE-large and clusters them via UMAP → HDBSCAN, surfacing six dense controversies — greenwashing, privacy and tracking, generative AI, sexism in advertising, social-vs-traditional media, and the obsolescence of the traditional advertising model. Stage 3 re-reads only the articles of each cluster under a cluster-specific domain-expert persona, extracting companies with a role (e.g. accused / committed / defending / praised), a direct-quote evidence, and a 0–10 relevance score. The results recover the expected antagonists TotalEnergies on greenwashing, Google and Meta on privacy and surface sharper findings: Patagonia as the lone credible counter-example in greenwashing coverage, Aubade as a genuinely contested actor across the sexism debate, and a near-unanimous pro-AI stance that reveals the absence of live controversy rather than its presence. The contribution is methodological: a reusable three-stage design in which clustering supplies the topical lens that a single general prompt cannot.
About this workshop
The aim of this workshop is to promote technical and practical exchanges between researchers who use NLP methods. There is no hesitation in detailing the code (r/python), sharing tips, and discovering new methods and models.
Periodicity: Thursdays from 12h15 to 13h30, by videoconference.