Sunday, November 22, 2009
Comments Week 10
Comments Week 10
Thursday, November 19, 2009
Assignment 6: Website
Week 10: Web Search and OAl
Unit 10: Web Search and OAI Protocol
David Hawking , Web Search Engines: Part 1 and Part 2 IEEE Computer, June 2006.
Part 1:
-search engines cannot and should not index the whole internet (some content is protected, and a web page can be created in seconds)
-Figure one was very helpful - it shows how a generic search engine processes and retrieves information.
-Different data centers can have different functions on their own, or operate several different functions
-crawling
-indexing
-query process
-in order to process search engine queries large-scale replication is requires
-and it takes a lot of time (10+ in 2006 to do one index of the web)
-"good seed" URLs link to a lot of good websites and are very helpful for crawlers
-crawlers operate on a parallelism in order to increase accuracy and speed
-crawlers save considerable time by recognizing and eliminating duplicated information on different URLs
-spammers can trick a crawler by creating fake URLs that the page 'links' to and cloaking: the process of delivering different content to crawlers than to site visitors.
Part 2:
-This is a little bit more confusing
-Search engines use an inverted file to rapidly index and identify terms
-so basically it counts the frequency of a search term within a document and then organizes them by their number order?
-search engines always have to be cautious of the creation of new terms/slang/acronyms/trademarks (the R2-D2 example)
-indexers compress data to save space and memory
-anchor text helps the search engine know the quality of the page
-Link popularity score: Google
-One Major problem: poor results (The Onion the newspaper site brings back stuff about the vegetable)
-search engines can use ranking processes but it may take long
-indexes are created in decreasing value so that the query processor doesn't have to search the whole list
-caching can help speed things up
Shreeves, S. L., Habing, T. O., Hagedorn, K., & Young, J. A. (2005). Current developments and future trends for the OAI protocol for metadata harvesting. Library Trends, 53(4), 576-589.
-OAI-PMH: Open Archives Initiative Protocol for Metadata Harvesting was first released in 2001
-Mission: "develop and promote interoperability standards that aim to facilitate the efficient dissemination of content"
-Encourages the use of Dublin Core metadata schema
-developed with an Andrew W. Mellon Foundation grant
-the initiative recognized that in order to make several specific information providers to a broader audience a community based protocol needed established
-examples: the Sheet Music Consortium, National Science Digital Library
-the OAI-PMH worked to complete the listings of repositories and their individual collections
-also, one of the main objectives was to make it searchable and browsable
-machine processing was also a concern
-Some issues:
-different types of metadata styles (migrating the formats). this also creates a problem for searching.
-communication problems: very loosely federated and connected, need to create a community for users and providers
-controlling vocabularies as the initiative grows
**This was a very interesting article. It recognizes the need for an electronic and collaborative collection of metadata in order to make archival collections accessible and searchable. Yet it clearly recognizes that it is extremely hard to accomplish. Different groups and archives use different types of metadata schemas and different processing procedures. I briefly visited the website and it was very detailed and specific which can be good or bad. Should certain archives change the way they have been doing things for decades to fit within the specifications of this protocol??
Michael K. Bergman, "The Deep Web: Surfacing Hidden Value"
-The web is very complex, the information we don't readily think about or access in on the 'deep web'
-I'm sure there have been some developments in this area seeing as the paper was written eight years ago
-the 'deep' web is abstract yet measurable?
-Have to use a direct query search to access info on the deep web
-typical search engines only scour the surface web - is that true even today?
-A deep web page many have many records within it
**University libraries with online databases are one example of a deep web page
-In 2001 this study documented 60 sites that exceed the surface web by 40 times
-Searches into the deep web may take longer but are probably of higher quality because they are a direct query
How relevant is this study today since it was conducted almost a decade ago? Have search engines adapted in order to search the deep web? Why would you want to search the deep web?
