IT博客汇 | Extracting all links from my slidedeck

Extracting all links from my slidedeck

Maëlle's R blog on Maëlle Salmon's personal website发表于 2024-07-16 00:00:00

[This article was first published on Maëlle's R blog on Maëlle Salmon's personal website, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Last week after my useR! talk, someone I had met at the R-Ladies dinner asked me for a list of all the links in my slides. I said I’d prepare it, not because I’m a nice person, but because I knew it’d be an use case where the great tinkr package would shine!

What is tinkr?

tinkr is an R package I created, and that its current maintainer Zhian Kamvar took much further that I’d ever would have. tinkr can transform Markdown into XML and back.

Under the hood, tinkr uses

commonmark for the Markdown-to-XML conversion. CommonMark, in the form of its cmark implementation, is the C library that GitHub for instance uses to display your Markdown comments as HTML. The commonmark package is also what powers Markdown support in roxygen2.
xslt for the XML-to-Markdown conversion. XSLT is a templating language for XSLT.

Anyway, enough said, let’s go back to today’s use case.

Extract and format links from `index.qmd`

With tinkr I can use XPath, the query language for XML or HTML, to extract links from my slidedeck source. Then I will format them as a list.

First, I create a yarn object from my slidedeck source.

talk_yarn <- tinkr::yarn$new("/home/maelle/Documents/conferences/user2024/index.qmd")
talk_yarn
#> <yarn>
#>   Public:
#>     add_md: function (md, where = 0L) 
#>     body: xml_document, xml_node
#>     clone: function (deep = FALSE) 
#>     get_protected: function (type = NULL) 
#>     head: function (n = 6L, stylesheet_path = stylesheet()) 
#>     initialize: function (path = NULL, encoding = "UTF-8", sourcepos = FALSE, 
#>     md_vec: function (xpath = NULL, stylesheet_path = stylesheet()) 
#>     ns: http://commonmark.org/xml/1.0
#>     path: /home/maelle/Documents/conferences/user2024/index.qmd
#>     protect_curly: function () 
#>     protect_math: function () 
#>     protect_unescaped: function () 
#>     reset: function () 
#>     show: function (lines = TRUE, stylesheet_path = stylesheet()) 
#>     tail: function (n = 6L, stylesheet_path = stylesheet()) 
#>     write: function (path = NULL, stylesheet_path = stylesheet()) 
#>     yaml: --- format:   revealjs:       highlight-style: a11y      ...
#>   Private:
#>     encoding: UTF-8
#>     md_lines: function (path = NULL, stylesheet = NULL) 
#>     sourcepos: FALSE

Then I extract all links.

links <- xml2::xml_find_all(
  talk_yarn$body, 
  xpath = ".//md:link",
  ns = talk_yarn$ns
)
head(links)
#> {xml_nodeset (6)}
#> [1] <link destination="https://user-maelle.netlify.app" title="">\n  <text xm ...
#> [2] <link destination="https://www.pexels.com/photo/old-cargo-ship-on-sea-207 ...
#> [3] <link destination="https://www.pexels.com/photo/the-word-louise-is-spelle ...
#> [4] <link destination="https://www.pexels.com/photo/gray-rotary-telephone-on- ...
#> [5] <link destination="https://www.pexels.com/photo/close-up-photography-of-y ...
#> [6] <link destination="https://www.r-consortium.org/all-projects/call-for-pro ...

I then throw away the links to the great website Pexels, because these are image credits rather than information useful to do R stuff.

links <- purrr::discard(
  links, 
  \(x) startsWith(xml2::xml_attr(x, "destination"), "https://www.pexels")
)
head(links)
#> {xml_nodeset (6)}
#> [1] <link destination="https://user-maelle.netlify.app" title="">\n  <text xm ...
#> [2] <link destination="https://www.r-consortium.org/all-projects/call-for-pro ...
#> [3] <link destination="https://www.r-consortium.org/all-projects/call-for-pro ...
#> [4] <link destination="https://www.heltweg.org/posts/who-wrote-this-shit/" ti ...
#> [5] <link destination="https://fosstodon.org/@hadleywickham/11202130903588421 ...
#> [6] <link destination="https://nostarch.com/kill-it-fire" title="">\n  <text  ...

After that I can format the links and display them here using an “asis” chunk. Yes, my slidedeck uses Quarto but this blog is still powered by R Markdown, hugodown to be precise.

I’m using the formatting as an opportunity to only keep distinct links: sometimes I had very similar slides in a row, with repeated information.

format_link <- function(link) {
  url <- xml2::xml_attr(link, "destination")
  text <- xml2::xml_text(link)
  sprintf("* [%s](%s)", text, url)
}

formatted_links <- purrr::map_chr(links, format_link)

formatted_links <- unique(formatted_links)

formatted_links |>
  paste(collapse = "\n") |>
  cat()

Conclusion

Using tinkr, XPath and sprintf(), I was able to create a list of all the links shared in my useR! slidedeck. Some of them have no text, meaning that the URL is used as text for the link; or text that only makes sense in the context of the paragraph they were a part of; others are slightly more informative; but at least none of them is a “click here” link.

To leave a comment for the author, please follow the link and comment on their blog: Maëlle's R blog on Maëlle Salmon's personal website.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Continue reading: Extracting all links from my slidedeck

Extracting all links from my slidedeck

What is tinkr?

Extract and format links from index.qmd

Conclusion

Extract and format links from `index.qmd`