ZoTFIDF: Adding custom tf-idf tags to your Zotero database

I wanted to add my own custom tags to my Zotero citation management database based on the term frequency—inverse document frequency (tf-idf) formula. While not perfect, this is a nice way to pick out what is distinct and important about individual texts.
How did I do this?
It turns out that Zotero stores all the indexed text from attachments (e.g., .pdf
files) in a file named .zotero-ft-cache
inside each individual item’s folder.
Amazingly, the R
package, readtext
allows you to import text from plaintext files that are organized in this way using:
textDF <- readtext(paste0("~/Zotero/storage", "/*/.zotero-ft-cache"),
docvarsfrom = "filepaths")
Once I did this, I then was able to use the DBI
and RSQLite
packages to interact directly with the Zotero database, zotero.sqlite
.
With a few tricks from the quanteda
package and a few left-joins, I was able to add new tags to all my existing items with indexed text that represent the 5 features with the greatest tf-idf score!
Now I’m slowly adding those tags to justdeserts.org, success!
Code
## # A tibble: 507 x 13
## # Groups: name [435]
## itemID type name tagID parentItemID linkMode contentType charsetID path
## <int> <dbl> <chr> <dbl> <int> <int> <chr> <int> <chr>
## 1 165 1 Prog… 25131 165 2 applicatio… NA "C:\…
## 2 241 1 Index 25158 241 2 applicatio… NA "C:\…
## 3 241 1 maln… 20268 241 2 applicatio… NA "C:\…
## 4 243 1 CONC… 18803 243 2 applicatio… NA "C:\…
## 5 468 1 Dean 25315 468 1 text/html 1 "sto…
## 6 551 1 Hoch… 25513 551 1 text/html 1 "sto…
## 7 1448 1 11-10 25271 1448 0 applicatio… NA "sto…
## 8 1503 1 gmail 25342 1503 1 text/html 1 "sto…
## 9 1562 1 hours 17277 1562 2 applicatio… NA "/ho…
## 10 1585 1 supra 17482 1585 2 applicatio… NA "/ho…
## # … with 497 more rows, and 4 more variables: syncState <int>,
## # storageModTime <int64>, storageHash <chr>, itemIDOriginal <int>
## # A tibble: 25 x 5
## # Groups: name [25]
## itemID name collectionID fieldID value
## <int> <chr> <int> <int> <chr>
## 1 19593 markets 16441 110 Success and Luck: Good Fortune and t…
## 2 19621 poisoning 16441 110 The case against equality of opportu…
## 3 19637 Mobility 16441 110 The Other American Dream: Social Mob…
## 4 19642 2013-06 16441 110 U.S. Economic Mobility: The Dream an…
## 5 19642 born 16441 110 U.S. Economic Mobility: The Dream an…
## 6 19764 growth 16441 110 The Threat of Inequality of Opportun…
## 7 19766 jech-2019-… 16441 110 Equality of opportunity is linked to…
## 8 19770 WordPress.… 16441 110 Poverty and equality of opportunity:…
## 9 19786 Ancient 16441 110 Virtue and Happiness: Essays in Hono…
## 10 19788 Agyei 16441 110 Foreign Minister Demands Equal Oppor…
## # … with 15 more rows
Thanks to @adamsmith for reminding me of this (https://www.zotero.org/support/dev/client_coding/direct_sqlite_database_access) in the comment: https://forums.zotero.org/discussion/75571/exporting-metadata-and-cached-full-text-in-some-format↩