Zotero my way
Zotero is ready to take off, rolling over Endnotes on the runway.
[all the praise go here
]
But - you know it’s coming - here are some of the things I’d like to change. Sorry, no time for links now.
Automatic PDF fetching
Looks like the screen scrapers tries to get the PDFs. It works well with Science Magazine, but it fails on some other sites, grabbing blank HTML pages instead.PDF organization
But more importantly, I WANT MY PDF FILES IN A FOLDER THAT IS NOT OBSCURE, GOOGLE-DESKTOP SERCHABLE, AND HUMAN READABLE. Please, oh, please.The way it goes, it automatically creates a "storage" folder under the Firefox profile, which includes a random string as part of the path name. Then, in a understandable but regretable twist, it creates a folder for each bibilio entry that has either attachments or snapshots, using the ENTRY ID (integer) as folder name. And the PDF is not renamed.
What this means is that you can pretty much only access these PDFs through Zotero. No, that’s not how I want it. I want my PDFs to be all in one place that I can see and browse; and I may want to share them, too.
Here’s how I want it:
- Change function getStorageDirectory() in XCOMP/zotero.js, and return a user-defined directory (has to be R/W by Zotero. The user should be able to add/change the preference. This would become the base URI/URL for the PDF.
- In attachment.js, change:
// Create directory for attachment files within storage directory
var destDir = Zotero.getStorageDirectory();
destDir.append(itemID);
destDir.create(Components.interfaces.nsIFile.DIRECTORY_TYPE, 0644);
so that no new directory is created. The URI/URL to the file needs to be coded differently. - Rename the PDF file according to a user preference: e.g., I like
$AUTH_LAST ($YEAR) - $TITLE_BRIEF
Others may want something else, but noone likes "fulltext.pdf."
- And in the database, store the URI/URL to the file. Copy the PDF to the URI. Note, keep it open which protocol to use. It may be within the file system, or over HTTP, FTP, P2P, etc. Leave room for upgrade.
HTML snapshots
Make it a single file, not a bunch of loose JPG, CSS, HTM files. I think this is pretty much the only reason to create subdirectories for each attachment. I have struggled with achiving HTML snapshots since 2000, when I started blogging. The best solution I had was the MHTML format, which was an option in IE for sometime. It’s a single file with HTML and simple MIME encoding of any relevant files rolled together. It’s basically the native format of HTML email before it’s unpacked by the your mail client. There were rumors about security flaws, but I don’t care. I don’t like thousands of sub-directories and tens of thousands of loose files on my hard drive.
If MHTML is not an option, the most reasonable thing to do is to zip it, rar it, tar it, or jar it. Just name the archive the same way you name the PDF file (see above). I assume Javascript or Mozila can handle some compressed formats natively.
Single files, please.
Integration with the writing process.
The Word plugin is promising, but if I were to do it, I’d try something else. I said many times elsewhere that I hate Endnote, in part because its Word plugin messes up the file. After upgrading this and that, or sending Word docs back and forth, I have had a number of irrecoverable problems with the citation tags.The problem is that the Endnote/Zotero plugin inserts a tag that basically codes the reference #s of the citations, i.e., the unique record ID in the E/Z database. Come on, we all know Word files move between computers. You simply cannot trust that a DOC will only be edited on a single PC with the same E/Z database.
Here’s how I’d design the plugin, which in principle should work in both Word and web-based writers:
- At the time of citation, inject data snipets in some micro-format, XML or otherwise. This would include everything you need for making the reference.
- use the plug-in to format the in-line citations (also take care multiple sources, etc.)
- use the plug-in to generate a reference list at the designated place; take care duplicated citations, reference sorting, etc.
This way, all the citation/reference info is embedded and is with the text, all the time. Sure, one can embed the record ID so as to synchronize with E/Z, but that’s a secondary concern. And the plug-in may call Endnote/Zoreto functions in formating references and/or citations. But always embed the data.
This is not unlike the recent (since I last played with mediawiki a year ago) citation extention on wikipedia. The difference is that they simply inserts a "^" mark and then add the ref later using server-side scripts. Let’s do it on the client side — because Zotero is on the client, not a server (yet). And Wikipedia do not distinguish fields. A well-designed microformat (CSL) would remedy the problem.
June 6th, 2007 at 5:08 pm e
We had a long discussion about this on the zotero-dev list recently, and it’s something that will be fixed. I actually think the primary priority needs to be good global identifiers, with embedded data as a convenience and fallback measure.
BTW, Word 2007 always embeds the data; in fact depends on it (because they use local ids).
June 6th, 2007 at 5:54 pm e
Re. PDFs:
They can already be indexed by your desktop search client.
Further, you can choose to store your Zotero data in some directory other than your Firefox profile (which solves your first request). There are good arguments for at least being able to retain the original filename (which can sometimes contain info). However, doing so also requires subfolders (so that no un-renamed PDF has the same name). Even if they permitted renames, there’d need to be a way to make a unique link. refbase, for example, does allow uploaded PDFs to be renamed to user preferences. But users are strongly encouraged to use the record ID as part of the folder of file name.
Also: How is “title brief” to be auto-generated?
Re HTML snapshots:
Again, altering the page as little as possible does have merits
MHTML is not workable–MHTML generation isn’t yet in Firefox & various extensions write MHTML that can’t be read by other browsers (many still can’t read MHTML, by the way). And what if you are grabbing a page to actually reuse content on it (to borrow images for a presentation)?
It is true the Moz have archive support, so your alternative request is possible. But I don’t see how it is desirable (particularly if you use computationally intensive compression)–fewer tools will be able to work with it directly–you’ll have to decompress it. Disk space is cheap these days, right?
Re Word plugin:
Yes! Discussions of how to give good IDs to citations have been ongoing in the developer google group.
June 6th, 2007 at 6:01 pm e
“record ID as part of the folder of file name.”
sorry that should be folder OR file name.
June 7th, 2007 at 10:26 pm e
Bruce — sounds like I should subscribe to zotero-dev. Glad people have thought about these issues.
I can see why a global id is appealing from a design point of view (and isn’t DOI created for that purpose?). I have no problem with that. But from a user perspective I’d strongly advocate for embedding biblio data. I can’t expect all my colleagues to have Zotero installed and/or to have the same entries as I have.
Ideally the plug in should work with or without Zotero. Here’s how I’d like it to work:
– you type the citation as you normally do, e.g., (Smith, 2007).
– select “Smith, 2007″, click “insert citation” button. If Zotero is on, it opens it. If it can’t find Zotero (or god forbid, Endnote), it opens a window where you can enter or copy/paste in the bib info. A simple text window; no fields, etc.
– if you paste in APA format reference, which I often copy from reference sections of other paper, it will parse it.
– better yet, if you paste in/drag in CLS, MARC, or whatever standard format, it will parse it for you.
– so now you have the info embedded.
– next time you click “cite” again, it will search existing refs and detect repeated citations, which sometimes needs a different treatment, according to APA.
– when you are ready, click “generate ref list” button, the plug in scans for the embedded data, sort and format them, and create the reference list.
– If you send a paper to a colleague who has Zotero, she will get a chance to import all the embedded ref to her own lib. Of course a global id such as DOI can definitely be part of the embedded data.
Call me picky, but one thing I hate about the Endnote plugin is that I can’t type the citations as I write. It forces me to stop in the middle of a sentence, open its ugly database and search for the entry. No, I want the writing experience to be more like writing a wiki, where you write anyway you like and link stuff later.
Word 2007? I hope I will never have to upgrade Word again. I am betting on Google Docs (my chance is not good so far), or least I will go with Open Office.
June 7th, 2007 at 11:12 pm e
Hi Rick,
thanks for the comments. I should have signed up on the zotero-dev list first. I suspect some of these issues have been discussed (and decided) there. Reading your comments I get some sense of what the merits might be but still have uncertainties.
– google desktop indexing: I rely on googld desktop for fulltext searches. I tried but gDesktop doesn’t seem to index the Zotero attachments in their default place, ie under the firefox profile. anybody else have similar experience?
– alternative “storage” folder: I got the SVN and the storage location in the trunk version is hardwired. Will I break anything if I simply change it to something else?
– pdf renaming: I took a quick look at the “attachment.js” source code and it looks like all it needs is a unique and stable URI to the PDF file. So as long as the PDF is not renamed again, the initial renaming should be ok, right? Of course, I may very well have missed something obvious.
– title_brief: truncate the title to say 128 letters long (Windows has some limit on filename length), and sanitize it. Make sure it’s unique. That’s what I have to do manually everyday.
– HTML snapshot: I was not concerned about disk space; rather, the goal was to avoid sub directories.
I prefer a flat folder with thousands of PDFs and/or HTML snapshots, all systematically (re)named. Then I always know where to find my PDFs/snapshots, and I can sort, etc. Subfolders (with non-meaningful names) makes it too complicated to find a single file. The MHTML (it’s a pitty firefox still doesn’t support it) and the compression ideas are means to avoid subfolders by making it one file for each snapshot.
——–
I can see how retaining the original filenames makes programming easier. Zotero is already way better than how Endnote handles PDF attachments — copy them to a strange place and rename them to some random strings. I am glad I didn’t fall to that trap (I have 4G PDF files).
Somebody is gonna hate me for saying this — I don’t want to be locked to a particular program, not even if it’s Open Source. I want my PDFs/snapshots to be still organized and useful long after the program is gone. If one day I uninstalled Zotero, what would I have left?
June 8th, 2007 at 12:36 pm e
I don’t use Windows extensively & therefore don’t use GDS. However, it seems that you’re right that it probably doesn’t search the default firefox profile directory (or at least not by default):
http://desktop.google.com/support/bin/answer.py?answer=12634
——–
Depending on the machine, I use either the release version or the dev version. Both can have the storage location modified safely. Gear icon->preferences->advanced. I suppose that moving it out of your Firefox profile won’t only solve the obscure location for you, but will solve the GDS indexing too.
——–
I agree that pdf renaming is technically possible. I just wanted to emphasize that “archivists” might not want renaming & those who wanted “pretty” names still have to prevent collisions. Both are doable.
——–
Thanks for explaining what you meant by title_brief. Simple truncation like that would definitely be possible. I had been imagining some sort of intelligent system to keep keywords of the title in (discarding the least essential words first). That’s the behavior I see by researchers who make an abbreviated title manually & I was wondering how to implement it.
——–
Again, I don’t see how subdirectories are bad for HTML snapshots. If you didn’t mind rewriting content, you could keep the primary HTML in the main folder & the associated files in a single subdirectory with the same name as that primary HTML file (I think most major browsers can do this when you save a webpage as multiple files).
It does look like GDS can index zip files, so ZIP might not be that bad. Can GDS do MHTML?
——–
Do you keep all 4GB of PDFs in a single directory? If we get renaming in Zotero, I hope we also can assign custom folder names (to store PDFs in directories based on year/journal/primary author/particular collection/whatever).
——–
I agree that lockin is bad, but I don’t think Zotero really locks you in. In this (beta) version, there might still be some barriers to migration. However, those barriers seem smaller than MANY other applications. I certainly don’t know of any proprietary reference management software that is easier to migrate away from than Zotero (even though they’ve had considerably more development time and resources & even when they’re popular enough to have third parties write tools to assist in migration). While some open source software would probably allow slightly easier migration, I really think Zotero is making active strides to make it easier to share data with other apps.
June 21st, 2007 at 1:52 pm e
Don’t have time to respond to everything here (though I would also encourage you to use zotero-dev and the forums, as almost all of these things have been discussed and/or addressed), but a few notes:
- Re: single file snapshots, archives, etc., see http://forums.zotero.org/discussion/919/ (Mozilla doesn’t have ZIP writing yet, just reading)
- Zotero has had automatic, customizable PDF renaming for a while (also mentioned on that thread), though the fields you can use are currently limited to the defaults. But it supports truncation and sanitization already.
- GDS aside (which, as Rick notes, can be made to work), Zotero has experimental PDF indexing itself. See http://www.zotero.org/documentation/pdf_fulltext_indexing. The next version will have much more user-friendly support.
- Word plugin ideas are interesting. Let’s have them on the dev list.