Shannon's LIS 2600 Information Technology Blog

http://web.me.com/shannonregan/LIS_2600/Home.html

I originally posted my website to my public Mac MobileMe account to preview it. When I attempted to upload it to my pitt afs space through the ftp server before the new tutorial came out I had a few problems. I thought maybe it was something to do with my file so I changed some of the settings. Unfortunately, that was not the issue and my website file was corrupted. The file was corrupted beyond retrieval and I would have had to start all over to post it to my pitt afs space. I spoke with Dr. He on 11/17/09 and said to put this information in my blog with the link to my website on my MobileMe space. The website on my MobileMe space is the exact website I would have uploaded to my pitt afs space.

Unit 10: Web Search and OAI Protocol

David Hawking , Web Search Engines: Part 1 and Part 2 IEEE Computer, June 2006.

Part 1:

-search engines cannot and should not index the whole internet (some content is protected, and a web page can be created in seconds)

-Figure one was very helpful - it shows how a generic search engine processes and retrieves information.

-Different data centers can have different functions on their own, or operate several different functions

-crawling

-indexing

-query process

-in order to process search engine queries large-scale replication is requires

-and it takes a lot of time (10+ in 2006 to do one index of the web)

-"good seed" URLs link to a lot of good websites and are very helpful for crawlers

-crawlers operate on a parallelism in order to increase accuracy and speed

-crawlers save considerable time by recognizing and eliminating duplicated information on different URLs

-spammers can trick a crawler by creating fake URLs that the page 'links' to and cloaking: the process of delivering different content to crawlers than to site visitors.

Part 2:

-This is a little bit more confusing

-Search engines use an inverted file to rapidly index and identify terms

-so basically it counts the frequency of a search term within a document and then organizes them by their number order?

-search engines always have to be cautious of the creation of new terms/slang/acronyms/trademarks (the R2-D2 example)

-indexers compress data to save space and memory

-anchor text helps the search engine know the quality of the page

-Link popularity score: Google

-One Major problem: poor results (The Onion the newspaper site brings back stuff about the vegetable)

-search engines can use ranking processes but it may take long

-indexes are created in decreasing value so that the query processor doesn't have to search the whole list

-caching can help speed things up

Shreeves, S. L., Habing, T. O., Hagedorn, K., & Young, J. A. (2005). Current developments and future trends for the OAI protocol for metadata harvesting. Library Trends, 53(4), 576-589.

-OAI-PMH: Open Archives Initiative Protocol for Metadata Harvesting was first released in 2001

-Mission: "develop and promote interoperability standards that aim to facilitate the efficient dissemination of content"

-Encourages the use of Dublin Core metadata schema

-developed with an Andrew W. Mellon Foundation grant

-the initiative recognized that in order to make several specific information providers to a broader audience a community based protocol needed established

-examples: the Sheet Music Consortium, National Science Digital Library

-the OAI-PMH worked to complete the listings of repositories and their individual collections

-also, one of the main objectives was to make it searchable and browsable

-machine processing was also a concern

-Some issues:

-different types of metadata styles (migrating the formats). this also creates a problem for searching.

-communication problems: very loosely federated and connected, need to create a community for users and providers

-controlling vocabularies as the initiative grows

**This was a very interesting article. It recognizes the need for an electronic and collaborative collection of metadata in order to make archival collections accessible and searchable. Yet it clearly recognizes that it is extremely hard to accomplish. Different groups and archives use different types of metadata schemas and different processing procedures. I briefly visited the website and it was very detailed and specific which can be good or bad. Should certain archives change the way they have been doing things for decades to fit within the specifications of this protocol??

Michael K. Bergman, "The Deep Web: Surfacing Hidden Value"

-The web is very complex, the information we don't readily think about or access in on the 'deep web'

-I'm sure there have been some developments in this area seeing as the paper was written eight years ago

-the 'deep' web is abstract yet measurable?

-Have to use a direct query search to access info on the deep web

-typical search engines only scour the surface web - is that true even today?

-A deep web page many have many records within it

**University libraries with online databases are one example of a deep web page

-In 2001 this study documented 60 sites that exceed the surface web by 40 times

-Searches into the deep web may take longer but are probably of higher quality because they are a direct query

How relevant is this study today since it was conducted almost a decade ago? Have search engines adapted in order to search the deep web? Why would you want to search the deep web?

Sunday, November 22, 2009

Comments Week 10

Comments Week 10

Thursday, November 19, 2009

Assignment 6: Website

Week 10: Web Search and OAl

Wednesday, November 18, 2009

Muddiest Point Week 10

Thursday, November 12, 2009

Muddiest Point Week 9

Monday, November 2, 2009

Comments Week 9

Shannon's LIS 2600 Information Technology Blog

Labels

Followers

Blog Archive

About Me