I wanted to add my own custom tags to my Zotero citation management database based on the term frequency—inverse document frequency (tf-idf) formula. While not perfect, this is a nice way to pick out what is distinct and important about individual texts.

“external access to the SQLite database (including direct access via the mozStorage API) should be done only in a read-only manner. Modifying the database while Zotero is running can easily result in a corrupted database, as mozStorage caching breaks the normal file-locking in SQLite that allows for safe concurrent file access. And even if Firefox is shut down before accessing the file, modifying the database directly bypasses the data validation and referential integrity checks performed by Zotero and the Zotero server that are required for Zotero to function properly. Generally, the SQLite database should be viewed as an internal database that has the benefit of being externally readable for people who want to get the data out in other ways.”1

How did I do this? It turns out that Zotero stores all the indexed text from attachments (e.g., .pdf files) in a file named .zotero-ft-cache inside each individual item’s folder. Amazingly, the R package, readtext allows you to import text from plaintext files that are organized in this way using:

textDF <- readtext(paste0("~/Zotero/storage", "/*/.zotero-ft-cache"),
docvarsfrom = "filepaths")

Once I did this, I then was able to use the DBI and RSQLite packages to interact directly with the Zotero database, zotero.sqlite. With a few tricks from the quanteda package and a few left-joins, I was able to add new tags to all my existing items with indexed text that represent the 5 features with the greatest tf-idf score! Now I’m slowly adding those tags to justdeserts.org, success!

# Code

## # A tibble: 507 x 13
## # Groups:   name [435]
##    itemID  type name  tagID parentItemID linkMode contentType charsetID path
##     <int> <dbl> <chr> <dbl>        <int>    <int> <chr>           <int> <chr>
##  1    165     1 Prog… 25131          165        2 applicatio…        NA "C:\…
##  2    241     1 Index 25158          241        2 applicatio…        NA "C:\…
##  3    241     1 maln… 20268          241        2 applicatio…        NA "C:\…
##  4    243     1 CONC… 18803          243        2 applicatio…        NA "C:\…
##  5    468     1 Dean  25315          468        1 text/html           1 "sto…
##  6    551     1 Hoch… 25513          551        1 text/html           1 "sto…
##  7   1448     1 11-10 25271         1448        0 applicatio…        NA "sto…
##  8   1503     1 gmail 25342         1503        1 text/html           1 "sto…
##  9   1562     1 hours 17277         1562        2 applicatio…        NA "/ho…
## 10   1585     1 supra 17482         1585        2 applicatio…        NA "/ho…
## # … with 497 more rows, and 4 more variables: syncState <int>,
## #   storageModTime <int64>, storageHash <chr>, itemIDOriginal <int>
## # A tibble: 25 x 5
## # Groups:   name [25]
##    itemID name        collectionID fieldID value
##     <int> <chr>              <int>   <int> <chr>
##  1  19593 markets            16441     110 Success and Luck: Good Fortune and t…
##  2  19621 poisoning          16441     110 The case against equality of opportu…
##  3  19637 Mobility           16441     110 The Other American Dream: Social Mob…
##  4  19642 2013-06            16441     110 U.S. Economic Mobility: The Dream an…
##  5  19642 born               16441     110 U.S. Economic Mobility: The Dream an…
##  6  19764 growth             16441     110 The Threat of Inequality of Opportun…
##  7  19766 jech-2019-…        16441     110 Equality of opportunity is linked to…
##  8  19770 WordPress.…        16441     110 Poverty and equality of opportunity:…
##  9  19786 Ancient            16441     110 Virtue and Happiness: Essays in Hono…
## 10  19788 Agyei              16441     110 Foreign Minister Demands Equal Oppor…
## # … with 15 more rows
