Difference between revisions of "Bibliographic/OOoBib Functional Requirements/Keywords"

From Apache OpenOffice Wiki
Jump to: navigation, search
m (Discussion)
m (Discussion)
Line 57: Line 57:
 
After a thorough thought, I believe more and more, that a '''standardization''' is both highly useful and needed. While I aknowledge that it will be difficult to get a working standardizaton in the immediate future, this is something that deserves to be worked hard on. I hope that it will get implemented somewhere in the more distant future. There is also an interesting comment on the need for standardising tags/ metadata on this site: http://netapps.muohio.edu/blogs/darcusb/darcusb/archives/2006/09/09/zotero-and-the-practical-semantic-web .
 
After a thorough thought, I believe more and more, that a '''standardization''' is both highly useful and needed. While I aknowledge that it will be difficult to get a working standardizaton in the immediate future, this is something that deserves to be worked hard on. I hope that it will get implemented somewhere in the more distant future. There is also an interesting comment on the need for standardising tags/ metadata on this site: http://netapps.muohio.edu/blogs/darcusb/darcusb/archives/2006/09/09/zotero-and-the-practical-semantic-web .
  
Until we get a working standardization, it is nevertheless pertinent to implement various other mechanisms needed for a more comprehensive keyword solution.
+
Until we get a working standardization, it is nevertheless pertinent to implement various other mechanisms needed for a more comprehensive keyword solution. See also the mailing lists for a brainstorming session: http://bibliographic.openoffice.org/servlets/BrowseList?listName=dev&by=date&from=2006-09-01&to=2006-09-30&first=1&count=23 .
 +
 
  
 
I wish to discuss 2 points:
 
I wish to discuss 2 points:

Revision as of 16:50, 29 September 2006

This document has been placed on the wiki so that members of the OpenOffice community can assist in developing the design and documentation for the enhanced bibliographic facility.

Back to OOoBib Functional Requirements

Keywords

One way to better sort articles is based on Keywords (see my post on keywords).

( tell me the title and date and I will inset a link to the message David Wilson )

However, there is another way I will shortly describe here.

There are a number of categories a research paper can belong to:

  • Basic Research
  • Theoretical Research (especially in Math/Physics)
  • Modeling
  • Trials:
    • randomized controlled trial
    • Meta-analysis
    • other trial
  • Review
  • Guideline
  • Correspondence
  • Editorial
  • Epidemiologic Study
  • Case Report
  • Images in clinical medicine (some Journals have such a feature/ could be a subgroup of Case Report)
  • Questions/ Question-Answers

If there are other relevant categories, feel free to implement them as well.

This is especially useful when searching for all trials on a given matter (e.g. for writing a meta-analysis or writing a review or a guideline), or for a specific case report.

I do have some >2500 of articles saved on my computer and searching for the correct file is a nightmare. It may seem that 2500 articles is a huge number, however in infections diseases this is only a minimum to start with.

It is useful to have a field storing this information. Although custom fields exist, this is a feature that should be standard. It allows searching (and grouping) articles on a more powerful basis.

Submitted as issue number 66353 by discoleo at Openoffice.org.

Implementation ideas

How should this be implemented ? Most bib and document systems I have seem to think that adding a field for keywords is enough and let the user the invent their own categories. I have been involved in IT development and document management systems and have had enough lectures from librarians (ie professional indexers) to know that this just leads to a big unmanageable mess, which librarians are often called in to try to fix.

Also a good keyword system has a good set of aliases defined. One insurance company was providing different compensation for fractured limbs than for broken limbs, because their compensation history search system did not have these aliases defined. The cases and the compensation history diverged as each of the staff used their preferred term.

So --- Should we build pre-defined document category sets that a user could select one for each document collection. i.e. Medical Research, Physical Sciences, Social Sciences etc ? David Wilson

Discussion

After a thorough thought, I believe more and more, that a standardization is both highly useful and needed. While I aknowledge that it will be difficult to get a working standardizaton in the immediate future, this is something that deserves to be worked hard on. I hope that it will get implemented somewhere in the more distant future. There is also an interesting comment on the need for standardising tags/ metadata on this site: http://netapps.muohio.edu/blogs/darcusb/darcusb/archives/2006/09/09/zotero-and-the-practical-semantic-web .

Until we get a working standardization, it is nevertheless pertinent to implement various other mechanisms needed for a more comprehensive keyword solution. See also the mailing lists for a brainstorming session: http://bibliographic.openoffice.org/servlets/BrowseList?listName=dev&by=date&from=2006-09-01&to=2006-09-30&first=1&count=23 .


I wish to discuss 2 points:

* limitations of current keywords
* how to standardize
* how to implement the standardization


Why Standardise

As more and more research data becomes available, it becomes increasingly difficult to efficiently use this data. The problem stems from the simple fact, that you do NOT get what you want. Most of the published data will end somewhere in the nirvana of computer storage, without beeing ever read by those who would benefit most of it. This problem is likely to deepen in the near future, as more and more journals appear and huge amounts of data are published.

To illustrate this further, it is helpful to perform some searches: when entering some common term, the search generates such a huges amount of hits, that it is even impossible to read all the titles. Searching for a funky term might narrow the results, but there are still thousand of hits. I do have indeed serious problems when searching for something. There is so much available literature, that I get easily overwhelmed, although, most of that is not relevant for the work. Refining a search is becoming increasingly difficult, and the time spent on searching can exceed the time needed to read the actuall article.

This fact has been recognized by Pubmed as well, and they have implemented various search strategies to increase the accuracy of the search (see e.g. Clinical Queries, http://www.ncbi.nlm.nih.gov/entrez/query/static/clinical.shtml ). However, this is only a workaround for the actual problem and will become ultimately insufficient, too.


Limitations of Current Keyword Strategies

Before discussing the steps necessary to implement a standardization, it becomes pertinent to point to some limitations of current indexing strategies using keywords.

  • currently, an article may have a number of keywords defined
    • this list is a plain text list
    • this plain structure is one of the reasons for failure of keywords
    • to be extensive (aka sensitive), you must define many keywords
    • and this undoubtedly reduces the specificity, i.e. when performing a search, many articles actually not needed would be retrieved, too; [it would be also very impractical to store such huge keyword lists]
  • to solve this paradox, one needs a hierarchical tree structure:
    • one keyword might implie another term as well
    • entering both terms as keywords will create however very large keyword lists and generate the problems mentioned above
    • therefore the need for a hierarchical tree (see later, Hierarchical Tree): one term points automatically to one (or more) trees, containing various furher search terms/keywords
    • the magic of this approach is, that we may change later the structure of these trees to adapt them for the particular search needs (see later)
    • these trees wouldn't be defined as a standard, but any user would create his own tree/relation to maximize his search results (both the sensitivity and specificity)


How to standardise

This is a huge task and I belive there is a reason, why there is no standardization to date. Therefore, before starting from scratch, it would be wise to search for work already done:

  • search for standards
  • contact librarians, other groups
  • contact others who might be interested or have done work in this field (e.g. Pubmed; I will try to contact the Pubmed team and hope for an answer)

Some journals already sort their articles based on some specific features (e.g. Circulation - the journal of the American Heart Association; Chest, and others). Therefore, it could be somewhat more easy to implement some of the standardisation, because professional societies do use them. However, other fields are covered less well and could cause some pain.

Probably it is the best thing to ask the professional societies to create such a framework.


How to implement this

In order to be used in practice, the program MUST already suggest some appropriate categories to the end-user. This could be more easily accomplished for the major article category, but for more detailed keywords it will become increasingly difficult. (YES, I believe that all the keywords should be standardised, as pointed out earlier; maybe sometimes in the future.)

Specific procedure:

  • scan journal title: jounals publish in most instances only articles from a very narrow field (except maybe Nature and Science)
  • scan title and abstract for some standard words (aka the keywords defined for that specific journal category/ research field)
  • depending on the words found, suggest an article category/ subcategory: e.g. medicine/ surgery/ abdominal surgery/ randomised controled trial; another example: veterinary medicine/ dog / infectious diseases/ rabies/ vaccine

I will continue in the next section with a more thorough discussion of this implementation.

Requirements

  • Keywords
  • Article categories
  • Journal category/classification


Keywords

Alias

alias: these are synonyms, i.e., the 2 words are equivalent

Hierarchical Keyword Tree

Hierarchical tree structure:

  • the presence of one term implies automatically another term, although the 2 are not aliases/synonyms, e.g.
    • endocarditis implies infection, bacteremia, heart valves and medicine, too;
    • another non-medical example: whale implies both mammal, ocean and water
  • dynamic trees
    • these trees must NOT be rigid
    • rather, they should be dynamic: a user may want to change the relationships later to optimise some search results and change it again for still another search
  • intersecting trees (complex relationships)
    • one keyword may belong to more than one tree:
      • endicarditis -> heart valves -> cardiology; and endocarditis -> bacteremia -> infection
      • a non-medical example: whale -> mammal -> animal; and whale -> ocean -> hydrosphere

The users should be able to:

  • write their own trees / tree relationships
  • store these trees for future use

Because this concept is so important, I will expand the endocarditis example:

cardiology <- heart valves <- endocarditis <- diagnosis, treatment, epidemiology (all 3 belong to this node)
infection <-|
            |- endocarditis <- Staphylococcus aureus, Streptococcus, fastidious organisms
            |- bacteremia <- endocarditis <- (various bacteria, see previous tree)

As it is seen, endocarditis might belong to 3 different trees and I may use any one (or 2 or all 3 of them), depending on what I wish to search.

Complex relationships

  • endocarditis occurs most often on the heart valves, therefore the heart valves <- endocarditis relationships (but not exclusively on heart valves)
  • endocarditis produces usually bacteremia (= bacteria in the blood), but blood culture may remain in ~10% of cases negative (due to antibiotic pretreatment or fastidious organisms)
  • the heart valves might get infected during a bacteremic episode itself, therefore the relationship endocarditis <-> bacteremia is more complex
  • depending on what I wish to search, I may create very specific keyword trees;

Article Categories

The article category should contain both the field of work (e.g. medicine) and the type of article (e.g. review). Therefore we should have:

  • category: see Journal Classification below
  • article type: see at the top of this page


Journal Classification

This describes what is needed to implement a standardized journal classification.

We need to define/create lists with:

  • basic categories: this needs to be defined at the top of the hierarchy; every article belongs to one (or more) of these basic categories
  • list of journals: needed for the next point;
  • basic category for journals: we will need to apply one or more categories to every journal.


Basic Field / Top Categories

Question: Do we need subcategories OR, more specifically, how do we define subcategories?

Some journals sort the articles based on some standardised subcategories (this would be usually the 3rd-4th item in the tree/ hierarchy):


These lists are incomplete. Please fill in whenever you find additional information.

Various editors sort their publications based on comprehensive speciality lists, e.g.:

Categories: Top Node

  • Humanities
  • Law
  • Life Sciences: Should we have one category Biomedical sciences?
  • Mathematics and Physical Sciences
  • Medicine: see Life Sciences
  • Social Sciences

[this list was taken from Oxford Journals]

  • Mathematics and Physical Sciences
    • mathematics
    • physics
      • quantum mechanics (these would be subcategories, ... or still main categories)
      • astrophysics
      • others
  • Life Sciences / Biomedical Sciences: part of Biomedical sciences?
    • biology
  • Biomedical Sciences / Medicine
    • non-surgical / internal medicine
      • cardiology
      • endocrinology
        • diabetology
      • gastroenterology
        • hepatology
      • haematology / hematology
      • infectious diseases: should be separate entity? [one node higher]
      • pulmology / respiratory medicine
      • nephrology
      • neurology
      • geriatric medicine: one node higher?
      • immunology / rheumatology: should be separate?
      • many subspecialities
    • dermatology
    • intensive care / critical care
    • cognitive sciences/ psichiatry
    • paediatrics/ pediatrics
    • radiology
    • surgery
      • abdominal surgery
      • cardio-vascular surgery/ cardiothoracic surgery
      • emergency medicine
      • obstetrics and gynecology
      • neurosurgery
      • ophthalmology
      • orthopedics
      • otolaryngolgy/ ent surgery
      • plastic surgery
      • urology
      • many subspecialities
    • dentistry
    • nursing

Should these be higher categories

    • infectious diseases
      • microbiology (could be one hierarchical node higher)
      • virology
      • parasitology
      • tropical medicine
      • epidemiology
    • microbiology (could be subspeciality of infectious diseases)

Feel free to expand this list!!!


Journals

This list will include the full name of the journal, the abbreviated name and the journal category.

Please note, that this list is important NOT only for this feature:

  • some journals require the FULL journal name in the bibliography (e.g. JAC requires Journal of Antimicrobial Chemotherapy and not J Antimicrob Chemother)
  • others require the abbreviated name (actually most journals fit here)
  • some journals have very short aliases (like JAC, CID, NEJM), which I would like to use when entering by hand a bibliographic entry, BUT this is not the official abbreviation and should therefore automatically be converted to the official abbreviation


I have imported 5269 journals from Pubmed (see gawk-script below)

  • Journal List Last Updated: September 20, 2006
  • the gawk-script will allow to easily update the list
  • this list does not contain the URL, nor Journal Category, but I will work to automate that, too
  • I believe, the list is too huge, to post it here (but it can be recreated easily with the gawk-script and I can compress it and post it somewhere as an attachment)


Sites With Journal Lists

There are various sites having extensive journal lists:

Journal List

I have this list as an OOo Writer document, too. (contains tables) I will expand it whenever I have time. One useful addition to this list would be the journal's url.

Full Journal Name Short Journal Name (Abbreviation) Custom Shortcut Journal Category URL


Infectious Diseases Journals

Full Journal Name Short Journal Name (Abbreviation) Custom Shortcut Journal Category URL
American Journal of Infection Control Am J Infect Control AJIC med, infx http://journals.elsevierhealth.com/periodicals/ymic/issues
Antimicrobial Agents and Chemotherapy Antimicrob Agents Chemother AAC med, infx, abx
Chemotherapy Chemotherapy med
Clinical Infectious Diseases Clin Infect Dis CID med, infx
Clinical Microbiology Reviews Clin Microbiol Rev CMR med, infx
Emerging Infectious Diseases Emerg Infect Dis med, infx
European Journal of Clinical Microbiology Eur J Clin Microbiol med, infx
European Journal of Clinical Microbiology and Infectious Diseases Eur J Clin Microbiol Infect Dis med, infx
Infection Infection med, infx
Infection Control Hospital Epidemiology Infect Control Hospital Epidemiol med, infx
Infectious Disease Clinics of North America Infect Dis Clin N Am med, infx
International Journal of Antimicrobial Agents Int J Antimicrob Agents med, infx, abx
Journal of Antimicrobial Chemotherapy J Antimicrob Chemother JAC med, infx, abx
Journal of Bacteriology J Bacteriol med, infx
Journal of Clinical Microbiology J Clin Microbiol JCM med, infx
Journal of Hospital Infection J Hosp Infect med, infx
Journal of Infectious Diseases J Infect Dis JID med, infx
Journal of Medical Microbiology J Med Microbiol JMM med, infx, microbiol
Microbes and Infection Microbes Infect med, infx
Microbiological Reviews Microbiol Rev med, infx, microbiol
Research in Microbiology Res Microbiol med, infx, microbiol http://www.sciencedirect.com/science/journal/09232508
Review Infectious Diseases Rev Infect Dis med, infx
Scandinavian Journal Infectious Diseases Scand J Infect Dis med, infx
Veterinary Microbiology Vet Microbiol biomed, vet, microbiol
International Journal of Systematic and Evolutionary Microbiology Int J Syst Evol Microbiol IJSEM biomed, med, infx, microbiol http://ijs.sgmjournals.org

General Medical Journals

Full Journal Name Short Journal Name (Abbreviation) Custom Shortcut Journal Category URL
American Journal of Medicine Am J Med med, all http://www.sciencedirect.com/science/journal/00029343
Annals of Internal Medicine Ann Intern Med med, intern http://www.annals.org
British Medical Journal BMJ BMJ med, all http://bmj.bmjjournals.com
Journal of the American Medical Association JAMA JAMA med, all http://jama.ama-assn.org
Lancet Lancet med, all http://www.thelancet.com
New England Journal of Medicine New Engl J Med NEJM med, all http://www.nejm.org

Statistics

Journal of Statistical Software, http://www.stat.ucla.edu/journals/jss/


All Categories

Full Journal Name Short Journal Name (Abbreviation) Custom Shortcut Journal Category URL
Nature Nature all
Science Science all

Cell/ Molecular Biology

Full Journal Name Short Journal Name (Abbreviation) Custom Shortcut Journal Category URL
Journal of Biological Chemistry J Biol Chem JBC biomed, cell biol, chem
Proceedings of the National Academy of Sciences of the USA Proc Natl Acad Sci USA PNAS biomed, cell biol, all

Oxford Journals

incomplete - Still need to do a lot of work!!! When I'll finish, I will move these entries in their respective category.

Age and Ageing Age Ageing med, geront http://ageing.oxfordjournals.org/
Alcohol and Alcoholism Alcohol Alcohol med, behav http://alcalc.oxfordjournals.org/
American Journal of Epidemiology Am J Epidemiol med, epidem http://aje.oxfordjournals.org/
Annals of Occupational Hygiene Ann Occup Hyg med, epidem, hygiene http://annhyg.oxfordjournals.org/
Annals of Oncology Ann Oncol med, oncol http://annonc.oxfordjournals.org/
BJA: British Journal of Anaesthesia Br J Anaesth BJA med, ICU http://bja.oxfordjournals.org/
Brain Brain med, neuro http://brain.oxfordjournals.org/
Brief Treatment and Crisis Intervention Brief Treat Crisis Interven med, behav http://brief-treatment.oxfordjournals.org/
British Medical Bulletin Br Med Bull med, all http://bmb.oxfordjournals.org/
Continuing Education in Anaesthesia, Critical Care & Pain Contin Educ Anaesth Crit Care Pain med, ICU http://ceaccp.oxfordjournals.org/
Europace Europace med, cardio http://europace.oxfordjournals.org/
European Heart Journal Eur Heart J med, cardio http://eurheartj.oxfordjournals.org/
The European Journal of Orthodontics Eur J Orthod med, dentist http://ejo.oxfordjournals.org/
The European Journal of Public Health Eur J Public Health med, epidem http://eurpub.oxfordjournals.org/
Evidence-based Complementary and Alternative Medicine Evid Based Complement Alternat Med eCAM med, alt http://ecam.oxfordjournals.org/
Family Practice Fam Pract med http://fampra.oxfordjournals.org
Health Education Research Health Educ Res med, epidem http://her.oxfordjournals.org
Health Policy and Planning Health Policy Plan med, epidem http://heapol.oxfordjournals.org
Health Promotion International Health Promot Int med, epidem http://heapro.oxfordjournals.org
Human Reproduction Hum Reprod med, gyn http://humrep.oxfordjournals.org
Human Reproduction Update Hum Reprod Update med, gyn http://humupd.oxfordjournals.org


GAWK HELPER SCRIPT

This page will contain some useful gawk scripts for formatting the different journal lists.


Requirements:

  • awk/gawk:
    • if you are on a UNIX machine, almost surely you will have it installed on your computer
    • if you're on a WINDOWS machine, almost surely you won't have it; you can get gawk for free from http://www.sourceforge.net project gnuwin32


Format PUBMED Journal List

The latest PUBMED Journal List can be downloaded from: http://www.ncbi.nlm.nih.gov/entrez/linkout/journals/jourlists.cgi?typeid=1&type=journals&format=text&operation=Show


Use:

  • Save the above list as a plain text file (Limitaion: it does NOT contain the very short Abbreviation, nor the URL or the Journal Category); you also need to manually delete the first line from that text file (it is not a journal entry!!!)
  • Save the following script as a file, e.g. this-script-file.awk
  • run the gawk script, e.g. gawk -f "this-script-file.awk" your-plain-text-file.txt
  • the script will create a new text file, Journals-Pubmed-Extracted.txt, that will contain the list with
    • Full Journal Names
    • Abbreviations and
    • ISSN (the journal entries are UNIQUE)
  • I will work to automate the URL import, too
    • Journal Category will remain a manual task
# This program EXTRACTS JOURNAL NAMES from the PUBMED JOURNAL TEXT LIST, v1.01
# The latest list can be downloaded from:
# http://www.ncbi.nlm.nih.gov/entrez/linkout/journals/jourlists.cgi?typeid=1&type=journals&format=text&operation=Show

# I have imported 5269 journals from Pubmed,
# Journal List Last updated: September 20, 2006


BEGIN {
	val = ""
	cel[1] = "" # ARRAY TAKING THE VALUES
		# 1: JOURNAL FULL NAME
		# 2: JOURNAL ABBREVIATION
		# 3: ISSN
} # END BEGIN



# START ACTUAL PROGRAM <------------------------------------>

# DELETE SPACES

/  /	{ gsub(/  +/, " ")  } # DELETE MULTIPLE SPACES
/^ /	{ gsub(/^ / , "" )  } # REMOVE TRAILING SPACES
/ $/	{ gsub(/ $/ , "" )  } # REMOVE TRAILING SPACES

/ [:]/	{ gsub(/ [:]/ , ":" ) } # REMOVE SPACE BEFORE ':'

{if(length($0) == 0) {next} } # SKIP EMPTY LINES


{
	split($0,cel,"|")
	
	# DELETE ENDING "." FROM JOURNAL NAME
	i = match(cel[1],/[.]$/)
	if(i > 0) {cel[1] = substr(cel[1],1,i-1) }

	s = cel[1] "\t" cel[2] "\t" cel[3]
	
	if(s == val) {next} # SKIP DUPLICATE ENTRY
	val = s # STORE PREVIOUS VALUE TO FIND DUPLICATES

	print s >> "Journals-Pubmed-Extracted.txt"
}

Format Data for this wiki

This script will format the table text for use on this wiki page.


Use:

  • Save your Journal Table as a plain text file, with the cells separated by tab and the rows as separate lines
  • Save the following script as a file, e.g. this-script-file.awk
  • run the gawk script, e.g. gawk -f "this-script-file.awk" your-plain-text-file.txt
  • the script will create a new text file, Journals-OOo.txt, that will contain the formatted text, suitable to paste into this wiki page


GAWK SCRIPT

BEGIN {
	intro = "{| cellspacing=\"0\" cellpading=\"5\" border=\"1\""
	print intro >> "Journals-OOo.txt"
} # END BEGIN


# START ACTUAL PROGRAM <------------------------------------>

# DELETE SPACES

/  /	{ gsub(/  +/, " ")  } # DELETE MULTIPLE SPACES
/^ /	{ gsub(/^ / , "" )  } # REMOVE TRAILING SPACES
/ $/	{ gsub(/ $/ , "" )  } # REMOVE TRAILING SPACES



{if(length($0) == 0) {next} } # SKIP EMPTY LINES


/\t/ {gsub(/\t/,"\n| ")}

{
	print "|-\n| " $0 >> "Journals-OOo.txt"
}

END {
	print "|}" >> "Journals-OOo.txt"
}
Personal tools