IT博客汇
  • 首页
  • 精华
  • 技术
  • 设计
  • 资讯
  • 扯淡
  • 权利声明
  • 登录 注册

    Extracting all links from my slidedeck

    Maëlle's R blog on Maëlle Salmon's personal website发表于 2024-07-16 00:00:00
    love 0
    [This article was first published on Maëlle's R blog on Maëlle Salmon's personal website, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
    Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

    Last week after my useR! talk, someone I had met at the R-Ladies dinner asked me for a list of all the links in my slides. I said I’d prepare it, not because I’m a nice person, but because I knew it’d be an use case where the great tinkr package would shine! 😈

    What is tinkr?

    tinkr is an R package I created, and that its current maintainer Zhian Kamvar took much further that I’d ever would have. tinkr can transform Markdown into XML and back.

    Under the hood, tinkr uses

    • commonmark for the Markdown-to-XML conversion. CommonMark, in the form of its cmark implementation, is the C library that GitHub for instance uses to display your Markdown comments as HTML. The commonmark package is also what powers Markdown support in roxygen2.
    • xslt for the XML-to-Markdown conversion. XSLT is a templating language for XSLT.

    Anyway, enough said, let’s go back to today’s use case.

    Extract and format links from index.qmd

    With tinkr I can use XPath, the query language for XML or HTML, to extract links from my slidedeck source. Then I will format them as a list.

    First, I create a yarn object from my slidedeck source.

    talk_yarn <- tinkr::yarn$new("/home/maelle/Documents/conferences/user2024/index.qmd")
    talk_yarn
    #> <yarn>
    #>   Public:
    #>     add_md: function (md, where = 0L) 
    #>     body: xml_document, xml_node
    #>     clone: function (deep = FALSE) 
    #>     get_protected: function (type = NULL) 
    #>     head: function (n = 6L, stylesheet_path = stylesheet()) 
    #>     initialize: function (path = NULL, encoding = "UTF-8", sourcepos = FALSE, 
    #>     md_vec: function (xpath = NULL, stylesheet_path = stylesheet()) 
    #>     ns: http://commonmark.org/xml/1.0
    #>     path: /home/maelle/Documents/conferences/user2024/index.qmd
    #>     protect_curly: function () 
    #>     protect_math: function () 
    #>     protect_unescaped: function () 
    #>     reset: function () 
    #>     show: function (lines = TRUE, stylesheet_path = stylesheet()) 
    #>     tail: function (n = 6L, stylesheet_path = stylesheet()) 
    #>     write: function (path = NULL, stylesheet_path = stylesheet()) 
    #>     yaml: --- format:   revealjs:       highlight-style: a11y      ...
    #>   Private:
    #>     encoding: UTF-8
    #>     md_lines: function (path = NULL, stylesheet = NULL) 
    #>     sourcepos: FALSE
    

    Then I extract all links.

    links <- xml2::xml_find_all(
      talk_yarn$body, 
      xpath = ".//md:link",
      ns = talk_yarn$ns
    )
    head(links)
    #> {xml_nodeset (6)}
    #> [1] <link destination="https://user-maelle.netlify.app" title="">\n  <text xm ...
    #> [2] <link destination="https://www.pexels.com/photo/old-cargo-ship-on-sea-207 ...
    #> [3] <link destination="https://www.pexels.com/photo/the-word-louise-is-spelle ...
    #> [4] <link destination="https://www.pexels.com/photo/gray-rotary-telephone-on- ...
    #> [5] <link destination="https://www.pexels.com/photo/close-up-photography-of-y ...
    #> [6] <link destination="https://www.r-consortium.org/all-projects/call-for-pro ...
    

    I then throw away the links to the great website Pexels, because these are image credits rather than information useful to do R stuff.

    links <- purrr::discard(
      links, 
      \(x) startsWith(xml2::xml_attr(x, "destination"), "https://www.pexels")
    )
    head(links)
    #> {xml_nodeset (6)}
    #> [1] <link destination="https://user-maelle.netlify.app" title="">\n  <text xm ...
    #> [2] <link destination="https://www.r-consortium.org/all-projects/call-for-pro ...
    #> [3] <link destination="https://www.r-consortium.org/all-projects/call-for-pro ...
    #> [4] <link destination="https://www.heltweg.org/posts/who-wrote-this-shit/" ti ...
    #> [5] <link destination="https://fosstodon.org/@hadleywickham/11202130903588421 ...
    #> [6] <link destination="https://nostarch.com/kill-it-fire" title="">\n  <text  ...
    

    After that I can format the links and display them here using an “asis” chunk. Yes, my slidedeck uses Quarto but this blog is still powered by R Markdown, hugodown to be precise.

    I’m using the formatting as an opportunity to only keep distinct links: sometimes I had very similar slides in a row, with repeated information.

    format_link <- function(link) {
      url <- xml2::xml_attr(link, "destination")
      text <- xml2::xml_text(link)
      sprintf("* [%s](%s)", text, url)
    }
    
    formatted_links <- purrr::map_chr(links, format_link)
    
    formatted_links <- unique(formatted_links)
    
    formatted_links |>
      paste(collapse = "\n") |>
      cat()
    
    • https://user-maelle.netlify.app
    • R Consortium ISC
    • https://www.heltweg.org/posts/who-wrote-this-shit/
    • https://fosstodon.org/@hadleywickham/112021309035884210
    • https://nostarch.com/kill-it-fire
    • “Refactoring Pro-Tip: Easiest Nearest Owwie First”
    • https://styler.r-lib.org/
    • https://masalmon.eu/2024/05/23/refactoring-tests/
    • {lintr} itself
    • reference index
    • continuous integration
    • https://masalmon.eu/2024/05/15/refactoring-xml/
    • Tidyteam code review principles
    • The Code Review Anxiety Workbook
    • General science lifecycle
    • Statistical software
    • now
    • then
    • Happy Git and GitHub for the useR
    • “Oh shit, Git!"
    • “How Git works”
    • Why you need small, informative Git commits
    • The two phases of commits in a Git branch
    • Hack your way to a good Git history
    • {saperlipopette}
    • Oh shit, Git!
    • No Maintenance Intended
    • What Does It Mean to Maintain a Package?
    • Three currencies of payment for our work
    • Package maintainer cheatsheet
    • dev guide
    • blog
    • Package Development Corner
    • paths of participation
    • Monthly newsletter
    • Blog
    • R-universe
    • https://ropensci.org/help-wanted
    • https://ropensci.org/news
    • https://devguide.ropensci.org/maintenance_evolution.html#archivalguidance
    • 2021 community call

    Conclusion

    Using tinkr, XPath and sprintf(), I was able to create a list of all the links shared in my useR! slidedeck. Some of them have no text, meaning that the URL is used as text for the link; or text that only makes sense in the context of the paragraph they were a part of; others are slightly more informative; but at least none of them is a “click here” link. 😅

    To leave a comment for the author, please follow the link and comment on their blog: Maëlle's R blog on Maëlle Salmon's personal website.

    R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
    Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
    Continue reading: Extracting all links from my slidedeck


沪ICP备19023445号-2号
友情链接