IT博客汇
  • 首页
  • 精华
  • 技术
  • 设计
  • 资讯
  • 扯淡
  • 权利声明
  • 登录 注册

    Ultra-processed food: an AI polling simulation

    Chris Bowdon发表于 2025-06-22 23:00:00
    love 0
    [This article was first published on Chris Bowdon, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
    Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

    There’s a little drip drip drip of scare stories about ultra-processed food (UPF) a phrase that I’d never heard until this year. I’m starting to get worried. Is everyone else worried? Well it turns out the Food Standards Agency (FSA) has been running a tracker survey containing this question since mid-2023, so we can find out.

    An idea that’s generating a lot of excitement recently is the suggestion that you can simulate public opinion polls with AI. The concept is very simple: you calibrate multiple LLM assistants to represent your different demographics and then sample their responses to questions. This is quicker and cheaper than polling real people, if you can make it reliable.

    That sounds incredibly fun! In this post I’ll explore the FSA tracker survey data, and then build an AI simulation model.

    The survey data

    First we need to scrape the survey data. While it’s lovely that the FSA publishes all the monthly survey results (which were run by YouGov) it is annoying that they don’t publish compiled statistics, so we need to get our tibbles dirty.

    As usual, if you like coding with R you can see the gory details under the folds.

    Code
    library(tidyverse)
    library(httr2)
    library(rvest)
    library(knitr)
    
    if (!str_ends(getwd(), "posts/food")) {
      setwd("posts/food")
    }
    
    download_files <- function() {
      url <- "https://data.food.gov.uk/catalog/datasets/0bfd916a-4e01-4cb8-ba16-763f0b36b50c"
    
      html_content <- request(url) |>
        req_perform() |>
        resp_body_string()
    
      # Assuming html_content is already defined
      doc <- read_html(html_content)
      xlsx_links <- doc %>% html_nodes("a[href$='.xlsx']") %>% html_attr("href")
      for (link in xlsx_links) {
        file_url <- link
        file_name <- basename(file_url)
        destfile <- sprintf("data/%s", file_name)
        if (!file.exists(destfile)) {
          download.file(url = file_url, destfile = destfile)
        }
      }
    }
    
    download_files()

    (Is there an easy way to read tracker survey spreadsheets into R? They have a common format, but I never found anything that did the job and had to roll my own. Anyway, eventually we end up with a single dataset.)

    Code
    library(tidyverse)
    library(readxl)
    library(knitr)
    
    extract_month_year <- function(file) {
      decoded_filename <- URLdecode(file)
    
      month_year_regex <- "^.*(January|February|March|April|May|June|July|August|September|October|November|December)\\s?(\\d{2}|\\d{4}).*.xlsx$"
    
      if (grepl(month_year_regex, decoded_filename)) {
        extracted_month_year <- regmatches(
          decoded_filename,
          regexec(month_year_regex, decoded_filename)
        )
    
        # Extract month and year
        month <- extracted_month_year[[1]][[2]]
        year <- extracted_month_year[[1]][[3]]
    
        if (str_length(year) == 2) {
          year <- paste0("20", year)
        }
    
        return(list(filename = file, month = month, year = year))
      } else {
        print(file)
        return(NULL)
      }
    }
    
    extract_question <- function(file, question_prefix) {
      fmy <- extract_month_year(file)
    
      data <- read_excel(
        file,
        sheet = "Percents",
        col_names = FALSE
      )
      headers <- as.vector(data[5, ], mode = "character")
      headers[1] <- "QCategory"
      headers <- zoo::na.locf(headers)
      subheaders <- as.vector(data[6, ], mode = "character")
      subheaders <- zoo::na.fill(subheaders, "")
      combined_headers <- map2(
        headers,
        subheaders,
        \(x, y) str_c(x, y, sep = ":: ")
      )
      combined_headers[1] <- "QCategory"
      combined_headers[2] <- "Total"
    
      colnames(data) <- combined_headers
    
      has_question <- any(str_starts(pull(data, QCategory), question_prefix))
    
      if (!is.na(has_question)) {
        data |>
          filter(!is.na(QCategory)) |>
          slice(
            which(str_starts(QCategory, question_prefix)):n()
          ) |>
          head(10) |> # take the cats
          tail(-1) |> # drop the header
          mutate(
            filename = fmy$filename,
            monthname = fmy$month,
            year = as.integer(fmy$year),
            month = match(
              monthname,
              c(
                "January",
                "February",
                "March",
                "April",
                "May",
                "June",
                "July",
                "August",
                "September",
                "October",
                "November",
                "December"
              )
            )
          )
      }
    }
    
    question_df <- function(question_prefix) {
      df_filename <- sprintf("data/%s.rds", question_prefix)
      if (file.exists(df_filename)) {
        df <- read_rds(df_filename)
      } else {
        xlsx_files <- list.files(
          path = "data",
          pattern = "*.xlsx",
          full.names = TRUE
        )
        df <- xlsx_files |>
          sort(decreasing = TRUE) |>
          lapply(\(f) extract_question(f, question_prefix)) |>
          keep(\(x) !is.null(x)) |>
          bind_rows() |>
          mutate(Question = question_prefix)
    
        df |> write_rds(df_filename)
      }
      df
    }
    
    df <- question_df("Q12_14")
    
    tidy_df <- df |>
      filter(QCategory != "Unweighted base" & QCategory != "Base: All") |>
      pivot_longer(
        c(Total, contains("::")),
        names_to = "Demographic",
        values_to = "Value"
      ) |>
      mutate(
        Period = as.Date(sprintf("%04d-%02d-01", year, month)),
        Response = QCategory,
        Value = as.numeric(Value)
      ) |>
      select(Question, Demographic, Period, Response, Value) |>
      arrange(Question, Period, Response, Demographic)
    
    tidy_df |> head() |> kable()
    Table 1: Example of the survey data.
    Question Demographic Period Response Value
    Q12_14 Age:: 16-24 2023-08-01 Don’t know 0.0384
    Q12_14 Age:: 25-34 2023-08-01 Don’t know 0.0518
    Q12_14 Age:: 35-44 2023-08-01 Don’t know 0.0156
    Q12_14 Age:: 45-54 2023-08-01 Don’t know 0.0142
    Q12_14 Age:: 55-74 2023-08-01 Don’t know 0.0161
    Q12_14 Age:: 75+ 2023-08-01 Don’t know 0.0204

    Finally we have ✨tidy✨ data and we can visualise it.

    Code
    library(patchwork)
    
    RESPONSE_CATS <- c(
      "Highly concerned",
      "Somewhat concerned",
      "Not very concerned",
      "Not concerned at all",
      "Don't know"
    )
    
    net_concern <- tidy_df |>
      filter(Demographic == "Total" & str_starts(Response, "Net: "))
    
    breakdown <- tidy_df |>
      filter(Demographic == "Total" & !str_starts(Response, "Net: ")) |>
      mutate(
        Response = if_else(Response %in% RESPONSE_CATS, Response, "Don't know"),
        Response = factor(Response, levels = RESPONSE_CATS, ordered = TRUE)
      ) |>
      # Re-normalise after removing the IDK cat
      group_by(Question, Demographic, Period) |>
      mutate(Value = Value / sum(Value)) |>
      ungroup()
    
    plot_net_concern <- ggplot(net_concern) +
      aes(x = Period, y = Value) +
      scale_y_continuous(limits = c(0, 1)) +
      geom_point() +
      stat_smooth(method = "lm") +
      labs(title = "Net concern", x = NULL, y = "Proportion of respondents")
    
    plot_breakdown <- ggplot(breakdown) +
      aes(x = Period, y = Value, group = Response, colour = Response) +
      geom_point() +
      stat_smooth(method = "lm") +
      labs(title = "Response breakdown", x = NULL, y = "Proportion of respondents")
    
    plot_net_concern + plot_breakdown
    Figure 1: Tracker for “At the moment, how concerned, if at all, do you personally feel about ultra-processed, or over-processing of food?” Left is net concern, right is all responses.

    The number of people with concerns is evidently slowly increasing. A linear regression fits net concern very well, so it’s reasonable to expect continued growth at the same rate (pending any intervention like policy changes). Presumably the line bends as we approach the inevitable core of stubborn sods who will never admit to being concerned about anything, but the data we have doesn’t support that kind of model.

    The breakdown by response type is interesting. We have to watch our step, given the noisiness, but it does seem to show a migration towards being highly concerned.

    Note that I have merged the two categories “I don’t know enough to comment” and “Don’t know” in the original data for two reasons: first, there’s a technical limit of 5 categories coming later, and second there is very little difference between these two.

    The media landscape

    Besides calibrating the simulated demographics, the major challenge is ensuring that the models are exposed to the same information. If interested in current events, this information is most likely not in the training data. This means you need to ensure the AI has been exposed to the same media.

    Now, here’s an interesting observation: a surprisingly small number of publications account for a large proportion of UK readership. See this data from JournoFinder.

    The UK population is approximately 70 million people. Given that the Guardian and the Daily Mail target very different demographics (left and right wing) we wouldn’t expect much overlap between their readership, i.e. their combined monthly online readership alone probably covers half the news-reading British population. Note also that BBC News isn’t included in that table, but has a similar level of traffic, albeit with higher overlap.

    This suggests that if we’re interested in understanding how the media shapes public perception in the UK, we can do well by analysing the output of just a handful of sources. This is particularly true in the modern media landscape, in which:

    • Live television is a dying medium. (I was at a brilliant R meetup hosted by Datacove recently when the session discussion turned to media channels. No one present had watched live TV in the last 24 hours, and only around 1 in 10 in the last month.)
    • Radio has had its time, has had its power (though yet to have its finest hour 🎸).
    • Social media is increasingly popular, but news content on social media is dominated by reposts of articles from traditional journalists.

    So let’s create a corpus of news articles about ultra-processed food from the top 10 sources in the table above, plus the BBC. They have a combined readership of 64.8 million monthly readers in the UK (plus the BBC’s unknown-but-high readership); even with high overlap, this clearly represents the majority of the UK population. It’s enough to give us a solid understanding of what’s going on.

    I happen to work at Polecat who have licensed access to this data, so I can analyse it very easily. (Can’t share the raw data here though. 🙊)

    Since November 2023 there have been over 2.5 million articles from those sources, but only around 2000 mentioned ultra-processed food, and <500 seem to be focused on it, as opposed to mentioning it alongside other food or public health issues. Though all mentions would help with awareness, focus articles have a stronger effect on informing opinion.

    Code
    source("articles.R")
    
    upf_articles <- load_upf_articles()
    domain_counts <- load_domain_counts()
    
    ggplot(
      upf_articles |>
        mutate(
          mention_type = if_else(is_focus, "Focus article", "Mentions UPFs")
        )
    ) +
      aes(x = period, group = is_focus) +
      geom_histogram(stat = "count") +
      facet_wrap(~mention_type) +
      labs(
        title = "Articles mentioning ultra-processed food from top UK sources, by month.",
        x = NULL,
        y = "Article count"
      )
    Figure 2: Counts of articles mentioning ultra-processed food from the top UK online news sources, by month.

    N.B. The Sun and the Times have further licence restrictions, so aren’t included here.

    Although the number of articles focused on UPFs is not showing any trend, the number of articles mentioning UPFs is increasing over time. The fluctuation is most likely seasonal, and the trend is apparently upwards. There’s just about enough data to fix a SARIMA model. (We ought to have two years, we have eighteen months but the cycles seem to be sub-year.)

    Code
    mention_counts <- ts(
      (group_by(upf_articles, period) |> count())$n,
      start = c(2024, 1),
      deltat = 1 / 12
    )
    afit <- forecast::auto.arima(mention_counts, d = 1, D = 1)
    plot(forecast::forecast(afit, h = 5, level = 90))
    Figure 3: A (very uncertain) SARIMA fit of the mention counts.

    It’s tempting to look at the upward trend in mentions and the upward trend in net concern, and call it a day. Media mentions go up, net concern goes up, quod erat demonstratum! But this is a bit flimsy.

    Why can’t we just count mentions and be done with it?

    1. Even considering that we’ve already reasoned about UK audiences and scoped our search to selected top publications, this is still a poor model of readership. Every mention in every article counts equally, even though we can see that the Guardian has over ten times the number of readers the Financial Times does.
    2. This model assumes that every mention of UPFs contributes an equal amount of concern. Some of them might be positive, or cause different levels of concern.
    3. Net concern has increased by <5%, whereas the mentions have doubled. We have reason to be skeptical that mentions count is the whole story.

    We need to take a more nuanced approach.

    A better readership model

    Let’s start with a look at how the articles are distributed amongst the sources.

    Code
    ggplot(
      upf_articles |> group_by(domain, period) |> count()
    ) +
      aes(x = period, y = n, group = domain, colour = domain) +
      facet_wrap(~domain, nrow = 2) +
      geom_point(show.legend = F) +
      geom_line(linewidth = 1, show.legend = F)
    Figure 4: Articles focusing on ultra-processed food by source, since Jan 2024.

    Perhaps unsurprisingly, the Daily Mail has by far the most. Who’d have thought? They are also showing the most obvious increase in attention to the topic. (Incidentally, the Daily Mail also has the most articles about cancer, bone disease, and early death. They must employ a lot of medical professionals.) To be completely fair to the Daily Mail though, we should note it is a smaller proportion of their total output, which is enormous.

    Code
    ggplot(
      group_by(domain_counts, domain) |> summarise(n = sum(n_articles))
    ) +
      aes(
        y = fct_reorder(domain, n),
        x = n,
        group = domain,
        colour = domain,
        fill = domain
      ) +
      geom_col(linewidth = 1, show.legend = F)
    Figure 5: Total number of articles output by top sources, since Jan 2024.

    Modelling

    We now have enough information to build a Monte Carlo model of media exposure. We’ll keep it as simple as we can without compromising too much on representativeness.

    Everything should be as simple as possible, but no simpler. (Einstein?)

    Let’s start by assuming the UK population is exposed to articles from each source in proportion to that source’s estimated readership. We’re going to get a bit maths now.

    Let every reader sample a number of articles from each source according to a Poisson distribution - i.e. a natural discrete distribution of counts. The Poisson distribution is parameterised according to relative readership levels, so for example and on average but with lots of randomisation, so some of our simulated readers will draw more from the Mail, some more from the Metro, etc.

    The reader then draws articles at random from each source. (This is a simplification of course.) To represent the fact that only a small proportion of articles from each source in each period are about UPFs, we draw the observations from a binomial distribution parameterised by , the relative frequency of UPF articles for the source. The count of UPF-related articles we get back is denoted .

    Since is very small, this will be very inefficient and we will mostly draw zeros. But maths comes to the rescue, because the unconditional distribution of can be given by:

    This is much faster to compute. We need to do so for each of our simulated readers to get a count of relevant articles they will read from each source. We then materialise these articles by sampling from the actual UPF articles from the soure in that time period, generate the LLM assessment for the articles, and take the mean assessment.

    This is quite a complicated model, but it’s also leaving A LOT out. We’re ignoring the demographics of each source’s readership, for example, and we’re making the incorrect assumption that all articles are equally likely to be read. It would be better to have a multinomial distribution across the articles rather than our simple binomial. Ideally that would be informed by accurate article-level readership statistics, though those are difficult if not impossible to get for all sources. We’re also ignoring incomplete reads or misreads.

    Code
    readership <- tibble(
      domain = factor(DOMAINS, levels = DOMAINS, ordered = TRUE),
      monthly_readers = 1000000 *
        c(59, 19.7, 12.3, 8.3, 5.9, 5.7, 4.6, 2.6, 2.5, 1.3) # BBC estimated from OFCOM survey
    ) |>
      merge(
        domain_counts |>
          group_by(domain, period) |>
          summarise(n_total_articles = sum(n_articles)),
        all.x = TRUE
      ) |>
      merge(
        upf_articles |> group_by(domain, period) |> summarise(n_upf_articles = n()),
        all.x = TRUE
      ) |>
      mutate(
        n_upf_articles = coalesce(n_upf_articles, 0),
        prob_upf = n_upf_articles / n_total_articles,
        # Number of articles in the time period, calibrated such
        # that the expected number of BBC articles is approx 1/day.
        # This is an educated guess, not much research to back it I'm afraid. Pew found in 2015 that
        # "An overwhelming majority of both long-form readers (72%) and short-form readers (79%)
        # view just one article on a given site over the course of a month on their cellphone."
        # - https://www.pewresearch.org/journalism/2016/05/05/long-form-reading-shows-signs-of-life-in-our-mobile-news-world/
        # That was specific to the heady earlier days of the mobile web though.
        # I found various industry-backed studies suggesting it could be higher, such as this NewsWorks study that
        # found young people reading 6/day, however this is really suspicious and they have an
        # obvious incentive to inflate the numbers.
        # https://pressgazette.co.uk/media-audience-and-business-data/media_metrics/young-people-news/
        lambda = 30.4 * monthly_readers / max(monthly_readers),
      )

    I’ve had to make an educated guess on the expected number of articles read per month - details are in the code above.

    We can now build the model using R’s statistical functions and simulate a number of readers. Then we can examine how many articles on UPFs the readers are likely to read each month.

    Code
    set.seed(42)
    
    sample_article_counts <- function(month_readership, n_sim_readers) {
      result <- matrix(
        ncol = length(DOMAINS),
        nrow = n_sim_readers,
        dimnames = list(1:n_sim_readers, DOMAINS)
      )
      for (i in 1:nrow(month_readership)) {
        source <- as.list(month_readership[i, ])
        n <- rpois(n_sim_readers, source$lambda * source$prob_upf)
        result[, i] <- n
      }
      result
    }
    
    PERIODS <- seq.Date(
      from = as.Date("2024-01-01"),
      to = as.Date("2025-05-01"),
      by = "1 month"
    )
    
    N_SIM_READERS = 10000
    
    sim_counts <- PERIODS |>
      map(
        function(p) {
          sample_article_counts(
            month_readership = filter(readership, period == p),
            n_sim_readers = N_SIM_READERS
          )
        }
      )
    Code
    monthly_sim_count_totals <- sim_counts |> map(~ table(rowSums(.)))
    
    monthly_sim_count_totals |>
      map(as.data.frame) |>
      bind_rows() |>
      rename(n_upf_articles_read_in_month = Var1) |>
      group_by(n_upf_articles_read_in_month) |>
      summarise(percent_of_readers = 100 * mean(Freq) / N_SIM_READERS) |>
      kable(digits = 1)
    n_upf_articles_read_in_month percent_of_readers
    0 96.3
    1 3.6
    2 0.1
    3 0.0

    This suggests that around 4% of news readers will read an article on UPFs every month. That seems quite reasonable. The number has grown from around 2% to around 6% over the last year and a half. This is more realistic than our original model, which simply noted a doubling in the number of articles. Our more sophisticated model says the number of articles read has probably tripled, but that they only influence a very small proportion of readers.

    A little reminder: we have modelled just the top 10 UK news sources, which have orders of magnitude more eyeballs than all other online news sources. We would expect the contribution from local news, blogs, trade pubs, etc. to be negligible.

    Code
    upf_readership_over_time <- map2(
      PERIODS,
      monthly_sim_count_totals,
      function(p, t) {
        tibble(
          period = p,
          pc_sim_readers_reading_at_least_one_article = 100 *
            (N_SIM_READERS - t[[1]]) /
            N_SIM_READERS
        )
      }
    ) |>
      bind_rows()
    
    ggplot(upf_readership_over_time) +
      aes(x = period, y = pc_sim_readers_reading_at_least_one_article) +
      geom_line() +
      scale_y_continuous(
        limits = c(0, 20),
        breaks = seq(0, 20, 10),
        minor_breaks = seq(0, 20, 5)
      ) +
      labs(
        x = NULL,
        y = "Percent of simulated readers",
      )
    Figure 6: Change in percent of simulated readers who read at least one UPF article in a month over time.

    By examining the model parameters, we can see that the sources that contribute most are the Guardian, the Telegraph, and the BBC.

    Code
    #|
    readership |>
      group_by(domain) |>
      summarise(
        lambda = mean(lambda),
        prob_upf = mean(prob_upf),
        monthly_readers = mean(monthly_readers),
        monthly_total_articles = mean(n_total_articles),
        monthly_upf_articles = mean(n_upf_articles),
        exp_monthly_upf_articles = mean(lambda * prob_upf)
      ) |>
      mutate(
        exp_monthly_upf_articles = exp_monthly_upf_articles /
          sum(exp_monthly_upf_articles)
      ) |>
      arrange(desc(exp_monthly_upf_articles)) |>
      kable(
        digits = c(0, 1, 4, 0, 0, 0, 2),
        format.args = list(decimal.mark = ".", big.mark = ",")
      )
    Table 2: Mean monthly model parameters for each domain.
    domain lambda prob_upf monthly_readers monthly_total_articles monthly_upf_articles exp_monthly_upf_articles
    theguardian.com 10.2 0.0012 19,700,000 7,665 10 0.33
    telegraph.co.uk 2.4 0.0033 4,600,000 5,167 17 0.20
    bbc.co.uk 30.4 0.0001 59,000,000 17,938 2 0.11
    dailymail.co.uk 6.3 0.0006 12,300,000 52,210 33 0.11
    independent.co.uk 4.3 0.0006 8,300,000 14,587 9 0.07
    express.co.uk 3.0 0.0008 5,900,000 12,922 11 0.07
    mirror.co.uk 2.9 0.0008 5,700,000 13,484 10 0.06
    standard.co.uk 1.3 0.0006 2,600,000 6,563 4 0.02
    ft.com 0.7 0.0009 1,300,000 12,870 11 0.02
    metro.co.uk 1.3 0.0004 2,500,000 6,682 3 0.01

    Again this differs significantly from our simpler model, which would have pinned it all on the Daily Mail, as that source has the largest absolute number of UPF articles. The Telegraph is initially surprising given their comparatively low readership, but they have a very high probability of producing UPF articles.

    With a more realistic model of readership achieved, we can move on to the AI polling.

    Building the AI’s world view

    The next step is to sample specific articles for each reader given their counts. In our model, these are the articles that will influence opinions, i.e. those which inform the LLM survey respondents.

    Code
    sample_index <- expand.grid(
      period = 1:length(PERIODS),
      reader = 1:N_SIM_READERS,
      domain = 1:length(DOMAINS)
    )
    
    sampled_articles <- 1:nrow(sample_index) |>
      map(
        function(i) {
          p <- sample_index[i, 1]
          r <- sample_index[i, 2]
          s <- sample_index[i, 3]
    
          n <- sim_counts[[p]][r, s]
    
          if (n > 0) {
            upf_articles |>
              filter(period == PERIODS[p] & domain == DOMAINS[s]) |>
              slice_sample(n = n, replace = FALSE) |>
              mutate(period = PERIODS[p], reader = r, n = n)
          }
        },
        .progress = F
      ) |>
      bind_rows()
    
    print(
      sprintf(
        "Percent of readers who have read at least one UPF article in the total period: %.1f%%",
        100 * n_distinct(sampled_articles$reader) / N_SIM_READERS
      )
    )
    [1] "Percent of readers who have read at least one UPF article in the total period: 46.8%"

    Aggregating this sample this tells us that 50% of our (simulated) reading population have read at least one article about UPFs in the last year and a half.

    Finally we’re ready to see what effect these articles have on the readers. Break out the AI!

    Asking the AI questions

    Now we get to move on from that awful CPU-based statistic model (boo, dull) to an awesome GPU-based statistic model (wow, sexy).

    The approach we take is to force the LLM into picking a category via structured outputs, and then reviewing the probabilities that it would have picked each category. R afficionados, we are using ellmer, which is more or less the de-facto AI package for R.

    To avoid sharing the licensed news data with a third-party, the model is a local Qwen3 model running with a patched version of mlx-omni-server. Qwen3’s reasoning has been disabled because that changes the conditioning of the model and we just want its immediate “system 1” response to the survey questions. Working with mlx-omni-server is also awesome because when you’re confused about some of the finer points of logits and sampling you can just LOOK at the code and see what’s happening.

    Most of the AI code is folded, but it’s useful to see the prompts.

    # This is the system prompt, which initiates every conversation by giving the AI a persona and instructions.
    SYSTEM_PROMPT <- "You are a member of the British public, with your own life experiences and opinions.
    
    This is your identity: 
    <identity>
    {{simulated_identity}}
    </identity>
    
    You are being asked for your views as part of a survey. Respond to each question from the pollster."
    
    # This is the question precisely as asked in the YouGov survey.
    UPF_QUESTION <- "At the moment, how concerned, if at all, do you personally feel about ultra-processed, or over-processing of food?"
    
    # This is how we wrap the question with relevant articles (and a date, for context).
    QUESTION_ARTICLE_WRAPPER <- "Answer the question, recalling that you have seen the following articles that may or may not have influenced your opinion.
    
    <articles>
    {{headlines}}
    </articles>
    
    (Today's date is {{date_today}}.)
    
    QUESTION: {{question}}"
    Code
    library(ellmer)
    
    create_respondent <- function(
      simulated_identity = "Average person"
    ) {
      chat_openai(
        base_url = "http://0.0.0.0:10240/v1/",
        # Gemma is good
        #model = "mlx-community/gemma-3-1b-it-4bit-DWQ",
        #model = "mlx-community/gemma-3-27b-it-4bit-DWQ",
        #model = "mlx-community/gemma-3-27b-it-qat-8bit",
        # Qwen is great
        #model = "mlx-community/Qwen3-0.6B-4bit-DWQ-053125",
        #model = "mlx-community/Qwen3-1.7B-4bit-DWQ-053125",
        model = "mlx-community/Qwen3-4B-4bit-DWQ-053125",
        #model = "mlx-community/Qwen3-14B-4bit-DWQ-053125",
        # Uncensored?
        #model = "mlx-community/Josiefied-Qwen3-0.6B-abliterated-v1-4bit",
        # GPT for debugging, do not use with data
        #model = "gpt-4.1-nano",
        api_key = "n/a",
        system_prompt = interpolate(SYSTEM_PROMPT),
        params = params(
          # Single output (all the local server supports)
          n = 1,
          # Get max available log probs
          log_probs = TRUE,
          top_logprobs = 5,
          # Keep the full distribution for sampling
          top_p = 1,
          # Do not adjust the distribution, so no cascading randomness
          temperature = 0
        ),
        api_args = list(
          # Ensure we disable reasoning from Qwen
          enable_thinking = FALSE
        )
      )
    }
    
    extract_answer_dist <- function(respondent, choices) {
      rc <- tibble(
        token = LETTERS[1:length(choices)],
        Response = factor(choices, levels = choices, ordered = TRUE)
      )
    
      token <- respondent$last_turn()@json$choices[[1]]$logprobs$content |>
        keep(~ str_detect(.x$token, paste(rc$token, collapse = "|"))) |>
        first()
    
      if (is.null(token)) {
        stop(
          respondent$last_turn()@json$choices[[1]]
        )
      }
    
      token$top_logprobs |>
        bind_rows() |>
        select(token, logprob) |>
        unique() |>
        merge(rc) |>
        mutate(Probability = exp(logprob)) |>
        select(Response, Probability, logprob) |>
        arrange(Response)
    }
    
    question_with_articles <- function(question, articles, date_today = today()) {
      headlines <- articles |>
        map(function(a) {
          with(
            a,
            {
              excerpts <- paste(
                map(highlights, ~ sprintf("> %s", .)),
                collapse = "\n"
              )
              sprintf(
                "HEADLINE: %s (%s, %s)\nEXCERPTS:\n%s\n\n",
                title,
                domain,
                publish_date,
                excerpts
              )
            }
          )
        })
    
      interpolate(QUESTION_ARTICLE_WRAPPER)
    }
    
    
    multiple_choice_question <- function(question, choices) {
      paste(
        c(
          question,
          "",
          imap(choices, ~ sprintf("%s) %s", LETTERS[.y], .x)),
          "",
          "Respond with the letter of your chosen answer."
        ),
        collapse = "\n"
      )
    }
    
    ask <- function(respondent, question, choices) {
      mcq <- multiple_choice_question(question, choices)
      tryCatch(
        {
          respondent$chat_structured(
            mcq,
            type = type_enum(
              description = "Answer letter",
              values = LETTERS[1:length(choices)]
            )
          )
          extract_answer_dist(respondent, choices)
        },
        error = function(e) {
          print(respondent$last_turn()@json)
          stop(e)
        }
      )
    }

    We can now get the probability that the LLM would have selected each response in this conversation, given its identity.

    Code
    default_response <- ask(
      create_respondent("Average person"),
      UPF_QUESTION,
      RESPONSE_CATS
    )
    default_response |> kable(digits = 3)
    Table 3: The response distribution for the UPF survey question with a default “Average person” identity.
    Response Probability logprob
    Highly concerned 0.076 -2.574
    Somewhat concerned 0.439 -0.824
    Not very concerned 0.342 -1.074
    Not concerned at all 0.076 -2.574
    Don’t know 0.067 -2.699

    Note that I’ve removed the response category “I don’t know enough to comment” for technical reasons and its similarity to “Don’t know”.

    Although we have fixed the temperature to zero to avoid cascading randomness from token sampling, there is still some non-determinism at the hardware level. It almost never affects the most likely token and logits.

    We can now present the AI with an example article, to understand the article’s influence on its position.

    fake_articles <- list(
      list(
        title = "Ultra-processed food linked to potential bowel cancer risk, scientists say",
        domain = "dailymail.co.uk",
        publish_date = as.Date("2025-05-01"),
        highlights = c(
          "Ultra-processed foods increase the cancer risk by 3%.",
          "Maybe we should be concerned about ultra-processed foods."
        )
      )
    )
    
    ask(
      create_respondent("Average person"),
      question_with_articles(UPF_QUESTION, fake_articles),
      RESPONSE_CATS
    ) |>
      kable(digits = 3)
    Table 4: Example of simulated AI polling response to a fictional alarming headline.
    Response Probability logprob
    Highly concerned 0.022 -3.819
    Somewhat concerned 0.727 -0.319
    Not very concerned 0.162 -1.819
    Not concerned at all 0.036 -3.319
    Don’t know 0.053 -2.944

    We can also condition the model on a different identity to get a different response distribution.

    ask(
      create_respondent("Someone very concerned about their health"),
      question_with_articles(UPF_QUESTION, fake_articles),
      RESPONSE_CATS
    ) |>
      kable(digits = 3)
    Table 5: Response distribution of LLM again, this time with the identity of someone very concerned about their health.
    Response Probability logprob
    Highly concerned 0.418 -0.871
    Somewhat concerned 0.537 -0.621
    Not very concerned 0.016 -4.121
    Not concerned at all 0.010 -4.621
    Don’t know 0.018 -3.996

    A basic model, but already quite interesting. Let’s test it for bias.

    Code
    gender_resp <- c("Female", "Male") |>
      map(
        ~ ask(
          create_respondent(.),
          UPF_QUESTION,
          RESPONSE_CATS
        ) |>
          mutate(simulated_identity = .)
      ) |>
      bind_rows()
    
    # gender_resp |> mutate(Net = Response %in% c("Highly concerned", "Somewhat concerned")) |> group_by(simulated_identity, Net) |> summarise(p = sum(Probability))
    
    gender_resp |>
      pivot_wider(
        id_cols = Response,
        names_from = simulated_identity,
        values_from = Probability
      ) |>
      kable(digits = 3)
    Table 6: Response distribution of LLM for different genders.
    Response Female Male
    Highly concerned 0.082 0.109
    Somewhat concerned 0.603 0.552
    Not very concerned 0.196 0.203
    Not concerned at all 0.064 0.085
    Don’t know 0.056 0.051

    There is a little bias there. Normally we would seek to avoid this, but that’s the point of AI simulated polling, to exploit the biases to model “typical” responses. If we look at the actual responses by gender, we can see that there really is a gender difference i.e. female respondents were more likely to be highly concerned compared to males.

    In my testing, larger models were more likely to demonstrate bias in the same direction as real respondents (i.e. females having more net concern). Not all model families were suitable, for example Gemma 3 was more than 90% weighted towards “Highly concerned”.

    Code
    gender_breakdown <- tidy_df |>
      filter(str_starts(Demographic, "Gender") & !str_starts(Response, "Net: ")) |>
      mutate(
        Response = if_else(Response %in% RESPONSE_CATS, Response, "Don't know"),
        Response = factor(Response, levels = RESPONSE_CATS, ordered = TRUE),
        Demographic = str_replace(Demographic, "Gender:: ", "")
      ) |>
      # Re-normalise after removing the IDK cat
      group_by(Question, Demographic, Period) |>
      mutate(Value = Value / sum(Value)) |>
      ungroup()
    
    plot_gender_dist <- ggplot(
      gender_breakdown |>
        group_by(Demographic, Response) |>
        summarise(Value = mean(Value)) |>
        mutate(Value = if_else(Demographic == "Male", Value, -Value))
    ) +
      aes(x = Value, y = fct_rev(Response), fill = Demographic) +
      scale_fill_manual(values = c("Male" = "orange", "Female" = "purple")) +
      geom_col() +
      labs(
        title = "Gender response distribution",
        x = "Mean proportion of respondents",
        y = NULL
      )
    
    plot_gender_time <- ggplot(gender_breakdown) +
      aes(x = Period, y = Value, group = Response, colour = Response) +
      facet_wrap(~Demographic) +
      geom_point() +
      stat_smooth(method = "lm") +
      labs(
        title = "Response breakdown over time",
        x = NULL,
        y = "Proportion of respondents"
      )
    
    plot_gender_dist + plot_gender_time + plot_layout(nrow = 2, ncol = 1)
    Figure 7: Actual response distribution for different genders. Females are more likely to say they are “highly concerned” about ultra-processed foods.

    To make a useful gender comparison, we’d need to calibrate the LLM such that its predictions are consistent with known gender biases. The original paper from Argyle et al. did this by conditioning the LLM on backstories - where we just have “male” or “female”, they have a more comprehensive paragraph - and sampling from a set of backstories designed to be representative of the population.

    Running AI over the dataset

    Let’s apply this to our articles now. We can use furrr to run every simulated survey respondent in parallel. We’ll start with a single “average person” identity.

    Code
    library(furrr)
    
    samples_per_reader <- sampled_articles |>
      group_by(period, reader) |>
      summarize(
        sample_key = str_flatten_comma(sort(article_id)),
        article_ids = list(sort(article_id)),
        n_article = n_distinct(article_id),
      )
    
    unique_samples <- samples_per_reader |>
      distinct(period, sample_key, article_ids)
    
    sim_ident <- "Average person"
    
    simulate_reader <- function(row) {
      tryCatch(
        {
          arts <- upf_articles |> filter(article_id %in% row$article_ids)
          dist <- ask(
            create_respondent(sim_ident),
            question_with_articles(
              UPF_QUESTION,
              pmap(arts, list),
              date_today = row$period
            ),
            RESPONSE_CATS
          )
          dist |>
            mutate(sample_key = row$sample_key, simulated_identity = sim_ident)
        },
        error = \(e) print(e, row)
      )
    }
    
    map_simulated_readers <- function(unique_samples, sim_ident, force = FALSE) {
      datafile <- sprintf("data/article-responses - %s.rds", sim_ident)
    
      if (!file.exists(datafile) || force) {
        with(plan(multisession, workers = 4), {
          tasks <- future_map(
            pmap(unique_samples, list),
            simulate_reader,
            .progress = TRUE
          )
        })
    
        article_responses <- bind_rows(tasks) |> as_tibble()
        article_responses |> write_rds(datafile)
      }
      read_rds(datafile)
    }
    
    article_responses <- map_simulated_readers(
      unique_samples,
      sim_ident,
      #force = TRUE
    )
    
    sim_results <- samples_per_reader |> merge(article_responses)

    We have now asked every simulated reader the poll question every month. We have a probability distribution of possible answers for each, so we can up-sample. We also can assume that the readers we didn’t simulate saw nothing that would change their minds. This is ignoring word-of-mouth and other forms of media, so for a better model we will need to consider those.

    What we need to do now is combine the default probability distribution with the article-readers. We’ll do that month by month, and we can sample from the default distribution for the non-readers and sample from the individual distributions for readers. That would be much the same as a weighted sum, which is the simpler option.

    Code
    period_sim_dists <- matrix(nrow = length(PERIODS), ncol = length(RESPONSE_CATS))
    
    for (i in seq_along(PERIODS)) {
      period_sim_results <- sim_results |> filter(period == PERIODS[i])
      n_non_readers <- N_SIM_READERS - n_distinct(period_sim_results$reader)
    
      non_reader_dist <- n_non_readers * default_response$Probability
      reader_dist <- (period_sim_results |>
        group_by(Response) |>
        summarise(p = sum(Probability)))$p
      total_dist <- non_reader_dist + reader_dist
      period_sim_dists[i, ] <- total_dist / sum(total_dist)
    }
    
    rownames(period_sim_dists) <- as.character(PERIODS)
    colnames(period_sim_dists) <- RESPONSE_CATS
    
    sim_breakdown <- as_tibble(period_sim_dists) |>
      mutate(Period = PERIODS) |>
      pivot_longer(-c(Period), names_to = "Response", values_to = "Value") |>
      mutate(Response = factor(Response, levels = RESPONSE_CATS, ordered = TRUE))
    
    sim_net_concern <- sim_breakdown |>
      mutate(
        Concerned = Response %in% c("Highly concerned", "Somewhat concerned")
      ) |>
      group_by(Period, Concerned) |>
      summarise(Value = sum(Value)) |>
      filter(Concerned)
    
    plot_actual_breakdown <- ggplot(breakdown) +
      aes(x = Period, y = Value, group = Response, colour = Response) +
      geom_point(show.legend = FALSE) +
      stat_smooth(method = "lm", show.legend = FALSE) +
      labs(
        title = "Actual response breakdown",
        x = NULL,
        y = "Proportion of respondents"
      )
    
    plot_sim_breakdown <- ggplot(sim_breakdown) +
      aes(x = Period, y = Value, group = Response, colour = Response) +
      geom_point() +
      stat_smooth(method = "lm") +
      labs(
        title = "Simulated response breakdown",
        x = NULL,
        y = "Proportion of simulated respondents"
      )
    
    plot_actual_breakdown + plot_sim_breakdown
    Figure 8: Comparison of actual and simulated responses over time.

    This is encouraging! Remember the AI has not been calibrated on the actual survey data, only on our media model, and hasn’t had demographic conditioning at this point. Despite this, it predicts similar trends to reality: a small percentage point increase in high concern, with similar drops in “somewhat” and “not very”. The starting points for each trend are mostly wrong though, which points to the default response needing refining.

    Digging deeper

    Let’s dig into our simulated responses a bit.

    Code
    ggplot(
      article_responses |>
        group_by(Response) |>
        summarise(Probability = mean(Probability))
    ) +
      aes(y = forcats::fct_rev(Response), x = Probability, fill = Response) +
      geom_col(show.legend = F) +
      scale_x_continuous(limits = c(0, 1)) +
      labs(x = "Mean probability across all articles", y = "Response")
    Figure 9: Response distribution

    Again, we haven’t calibrated the LLM on real human responses, but among the headlines there are plenty about cancer, bone disease, and early death. “Highly concerned” is not unrealistic.

    This suggests that we could be underweighting the influence of media in our model. Considering just the contribution of simulated respondents who were exposed to media shows how these respondents have influenced the overall trend.

    Code
    ggplot(
      sim_results |>
        rename(Period = period) |>
        group_by(Period, Response) |>
        summarise(Value = sum(Probability) / N_SIM_READERS)
    ) +
      aes(x = Period, y = Value, group = Response, colour = Response) +
      geom_point() +
      stat_smooth() +
      labs(x = NULL, y = "Proportion of all respondents")
    Figure 10: Contribution of simulated readers to response trends.

    It’s interesting that the articles appear to have polarised the simulated readers, with both “Highly concerned” and “Not concerned at all” growing. The rise in “Don’t know” might be confusion from contradictory articles, or might be due to articles that only mentioned ultra-processed foods in passing.

    Where next?

    We could make some big improvements:

    • Better results would come from refining our media model. It seems likely that we’re underestimating the media influence.
    • So far we’ve only simulated thousands of “Average person” identities. Some basic demographic splits - the “silicon sampling” approach - would be expected to improve realism. That said, UPFs are not a highly polarised issue so the improvement would most likely be modest here.

    However, this little project could easily turn into a months-long labour. Time to stop, go outside, touch grass.

    Wrapping up

    We’ve covered a lot of ground here:

    1. Collated and visualised the FSA’s tracker survey for concern about UPFs.
    2. Built a probabilistic media readership model that reflects real UK news consumption.
    3. Built an LLM-based survey respondent simulator.
    4. Combined the Monte Carlo media model and the LLM respondents to simulate how British people exposed to news media over the last 18 months would respond to the UPF question, i.e. simulated the FSA’s tracker survey with AI.

    We were following the ideas and approach outlined by Argyle et al, but with a much more niche survey question (British public opinion on UPFs vs US voting preference) and with a media model to capture the information that would truly inform opinion.

    The simulated survey showed some promising correlation with the actual survey. It really is exciting considering the relative effort compared to polling a thousand of people every month. I can’t wait to see how this develops.

    To leave a comment for the author, please follow the link and comment on their blog: Chris Bowdon.

    R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
    Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
    Continue reading: Ultra-processed food: an AI polling simulation


沪ICP备19023445号-2号
友情链接