The deep web

 

Table Of Contents


Chapter ONE

INTRODUCTION

  • 1.1Introduction
  • 1.2Background of Study
  • 1.3Problem Statement
  • 1.4Objective of Study
  • 1.5Limitation of Study
  • 1.6Scope of Study
  • 1.7Significance of Study
  • 1.8Structure of the Research
  • 1.9Definition of Terms

Chapter TWO

LITERATURE REVIEW

  • 2.1Overview of Literature Review
  • 2.2Theoretical Framework
  • 2.3Conceptual Framework
  • 2.4Historical Perspective
  • 2.5Empirical Studies
  • 2.6Current Trends
  • 2.7Critical Analysis
  • 2.8Research Gaps
  • 2.9Methodological Approaches
  • 2.10Summary of Literature Review

Chapter THREE

SYSTEM DESIGN AND IMPLEMENTATION

  • 3.1Research Methodology Overview
  • 3.2Research Design
  • 3.3Data Collection Methods
  • 3.4Sampling Techniques
  • 3.5Data Analysis Procedures
  • 3.6Research Ethics
  • 3.7Reliability and Validity
  • 3.8Limitations of Methodology

Chapter FOUR

SYSTEM TESTING AND EVALUATION

  • 4.1Data Analysis and Interpretation
  • 4.2Descriptive Statistics
  • 4.3Inferential Statistics
  • 4.4Comparison of Findings
  • 4.5Themes and Patterns
  • 4.6Case Studies
  • 4.7Discussion of Results
  • 4.8Implications of Findings

Chapter FIVE

SUMMARY, CONCLUSION AND RECOMMENDATIONS

  • 5.1Conclusion and Summary
  • 5.2Recap of Research Objectives
  • 5.3Key Findings
  • 5.4Recommendations for Future Research
  • 5.5Practical Implications
  • 5.6Contribution to Knowledge
  • 5.7Conclusion Statement
  • 5.8Final Thoughts

Project Abstract

The deep web is a part of the internet that is not indexed by traditional search engines, such as Google or Bing. It is estimated to be significantly larger than the surface web, which is the portion of the internet that is indexed by search engines. The deep web includes a wide range of content that is not easily accessible through standard web browsers. This content may include private databases, password-protected websites, and other resources that are not intended for public consumption. One of the key characteristics of the deep web is its anonymity. Users can access deep web content without revealing their identities, making it a popular destination for individuals seeking privacy and security. This anonymity has also made the deep web a haven for illegal activities, such as drug trafficking, weapons sales, and other illicit transactions. Law enforcement agencies around the world have struggled to combat these activities, as the anonymous nature of the deep web makes it difficult to track down perpetrators. Despite its association with illegal activities, the deep web also has legitimate uses. For example, it can be used by journalists and activists in repressive regimes to communicate securely and access information that is censored by the government. Additionally, many businesses use the deep web to store sensitive data and conduct confidential transactions. Accessing the deep web can be challenging for the average internet user. Specialized software, such as Tor (The Onion Router), is often required to navigate the deep web safely and securely. Tor routes internet traffic through a series of encrypted relays, making it difficult for third parties to monitor users' online activities. While Tor provides a high level of security and privacy, it can also be slow and cumbersome to use compared to traditional web browsers. In conclusion, the deep web is a complex and multifaceted part of the internet that offers both opportunities and challenges. While it provides a valuable space for private communication and secure data storage, it also harbors illegal activities that pose significant risks to society. As technology continues to evolve, it is essential for policymakers, law enforcement agencies, and internet users to work together to address the unique challenges posed by the deep web and ensure that it is used responsibly and ethically.

Project Overview

<p> </p><p><strong>1.0 &nbsp; &nbsp; INTRODUCTION</strong></p><p>The volume of information on the web is<br>already vast and is increasing at a very fast rate according to Deepweb.com [1].<br>The Deep Web is a vast repository of web pages, usually generated by<br>database-driven websites, that are available to web users yet hidden from<br>traditional search engines. The computer program that searches the Internet for newly<br>accessible information to be added to the index examined by a standard search<br>tool search engine [2] used by these search engines to crawl the web cannot reach most of the<br>pages created on-the-fly in dynamic sites such as e-commerce, news and major<br>content sites, Deepweb.com [1].</p><p>According to a study by Bright Planet [3],<br>the deep web is estimated to be up to 550 times larger than the ‘surface web’<br>accessible through traditional search engines and over 200,000 database-driven<br>websites are affected (i.e. accessible through traditional search engines).<br>Sherman &amp; Price [4], estimates the amount of quality pages in the deep web<br>to be 3 to 4 times more than those pages accessible through search engine like<br>Google, About, Yahoo, etc. While the actual figures are debatable, it made it clear<br>that the deep web is far bigger than the surface web, and is growing at a much<br>faster pace, Deepweb.com [1].</p><p>In a simplified description, the web<br>consists of these two parts: <strong>the surface Web</strong>&nbsp;and <strong>the deep Web</strong><br>(invisible Web or hidden Web) but the deep Web came into public awareness only<br>recently with the publication of the landmark book by Sherman &amp; Price [4],<br>“The invisible Web: Uncovering Information Sources Search Engines Can’t See”.<br>Since then, many books, papers and websites have emerged to help further<br>explore this vast landscape and these needs to be brought to your notice too.</p><ol><li><strong>Statement of Problem</strong></li></ol><p>Most people access Web contents with Surface Search<br>Engines and 99% of Web content is not accessible through Surface Search Engines.</p><p>A complete approach to conducting<br>research on the Web incorporates using surface search engines and deep web<br>databases. However, most users of the Internet are skilled in at least<br>elementary use of search engines but the skill in accessing the deep web is<br>limited to a much smaller population. It is desirable<br>for most user of the Web to be enabled to access most of the Web content. &nbsp;This work therefore seeks to<br>address problems such as how Deep Web affects: search engines, websites,<br>searchers and proffered solutions.</p><ol><li><strong>Objective of the study</strong></li></ol><p>The broad objective of this study is<br>meant to aid IT researchers in finding quality information in less time. The<br>main objective of the project work can be stated more clearly as follows:</p><ol><li>To describe the<br>Deep Web and Surface Web</li><li>To compare deep<br>web and surface web</li><li>To develop a<br>piece of software to implement a Deep Web search technique</li></ol><ol><li><strong>Significance of the study</strong></li></ol><p>The study on deep web is necessary<br>because, it brings to focus problems encountered by search engines, websites<br>and searchers. More importantly, the study will provide information on the<br>results of searches made using both surface search engines and deep web search<br>tools. Finally, it presents deep web not only as a substitute for surface<br>search engines, but as a complement to a complete search approach that is<br>highly relevant to the academia and the general public.</p><ol><li><strong>Literature review</strong></li></ol><p><strong>What is Deep Web?</strong></p><p>Wikipedia [5], defined the <strong>surface Web</strong>&nbsp;(also known as the <strong>visible Web</strong>&nbsp;or <strong>indexable Web</strong>) as that portion of the World Wide Web that is indexed by conventional search engines. Search engines construct a database of the Web by using programs called spiders or Web crawlers that begin with a list of known Web pages. For each page the spider knows of it retrieves the page and indexes it. Any hyperlinks to new pages are added to the list of pages to be crawled. Eventually all reachable pages are indexed, unless the spider runs out of time or disk space. The collection of reachable pages defines the surface Web.</p><p>For various reasons (e.g., the Robots Exclusion Standard, links generated by JavaScript and Flash, password-protection) some pages cannot be reached by the spider. These ‘invisible’ pages are referred to as the <strong>Deep Web</strong>.</p><p>Bergman [6], defined the <strong>deep Web</strong>&nbsp;(also known as: <strong>Deepnet</strong>, <strong>invisible Web</strong>&nbsp;or <strong>hidden Web</strong>) to mean World Wide Web content that is not part of the surface Web indexed by search engines. Dr. Jill Ellsworth coined the term “Invisible Web” in 1994 to refer to websites that are not registered with any search engine.</p><p>Sherman and Price [4], defined <strong>deep web</strong>&nbsp;as text pages, files, or other<br>often high-quality authoritative information available via the World Wide Web<br>that general-purpose search engines cannot, due to technical limitations, or<br>will not, due to deliberate choice, add to their indices of Web pages.<br>Sometimes referred to as invisible web” or “dark matter”</p><p><strong>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; Origin of Deep Web</strong></p><p>In 1994, Dr. Jill H. Ellsworth a university professor who is also an<br>Internet consultant for Fortune 500 companies was<br>the first to coin the term “Invisible Web” [6]. In a January 1996 article, Ellsworth states: “It would be a site that’s possibly reasonably designed, but they didn’t<br>bother to register it with any of the search engines. So, no one can find them!<br>You’re hidden. He called that the <strong>invisible Web</strong>”.</p><p>The first commercial deep Web tool (although they referred to it as the “Invisible Web”)<br>was AT1 (@1) from Personal Library Software (PLS), announced December 12th,<br>1996 in partnership with large content providers. According to a December 12th,<br>1996 press release, AT1 started with 5.7 terabytes of content which was<br>estimated to be 30 times the size of the nascent World Wide Web.</p><p>Another early use of the term<br>“invisible web” was by Bruce Mount (Director of Product Development) and Dr.<br>Matthew B. Koll (CEO/Founder) of Personal Library Software (PLS) when<br>describing AT1 (@1) to the public. PLS was acquired by AOL in 1998 and AT1 (@1)<br>was abandoned [7], [8].</p><p>AT1 is an invisible web which allows users to find content “below,<br>behind and inside the Net” therefore users can now identify high quality<br>content amidst multiple terabytes of data on the AT1 Invisible Web; top<br>publishers join as charter members.</p><ol><li><strong>&nbsp;The Internet and the Visible Web</strong></li></ol><p>The primary focus of this project work<br>is on the Web and more specifically, the parts of the Web that search engines<br>can’t see (known as the invisible Web) but in order to fully understand the<br>phenomenon called the Invisible Web, it is important to first understand the<br>fundamental differences between the Internet and the Web.</p><p>Most people tend to use the words <strong>“Internet”</strong><br>and <strong>“Web”</strong>&nbsp;interchangeably, but they are not synonyms. The <strong>Internet</strong><br>is a networking protocol (set of rules) that allows computers of all types to<br>connect to and communicate with other computers on the Internet. The Internet’s<br>origin traced back to a project sponsored by the U.S. Defense Advanced Research<br>Agency (DARPA) in 1969 as a means for researchers and defense contractors to<br>share information. The <strong>World Wide Web (Web), </strong>on the other hand, is a<br>software protocol that allows users to easily access files stored on the<br>Internet computers. The Web was created in 1990 by Tim Berners-Lee, a computer<br>programmer working for the European Organization for Nuclear Research (CERN).<br>Prior to the Web, accessing files on the Internet was a challenging task,<br>requiring specialized knowledge and skills. The Web made it easy to retrieve a<br>wide variety of files, including text, images, audio, and video by the simple<br>mechanism of clicking a hypertext link. Hypertext is a system that allows<br>computerized objects (text, images, sounds, etc.) to be linked together, while<br>a Hypertext link points to a specific object, or a specific place with a text;<br>clicking the link opens the file associated with the object [4].</p><p>The Internet is a massive network of networks, a networking infrastructure. It connects millions of computers together globally, forming a network in which any computer can communicate with any other computer as long as they are both connected to the Internet. Information that travels over the Internet does so via a variety of languages known as protocols.</p><p>The World Wide Web, or simply Web, is a way of accessing information over the medium of the Internet. The Web uses the <strong>Hypertext Transfer Protocol</strong>&nbsp;(HTTP protocol), as one of the languages spoken over the Internet, to transmit data. The Web also utilizes browsers to access Web documents called Web pages that are linked to each other via hyperlinks. Web documents also contain graphics, sounds, text and video.</p><p>The Web is just one of the ways that information can be disseminated over the Internet. The Internet, not the Web, is also used for e-mail, relies on <strong>Simple Mail Transfer Protocol</strong>&nbsp;(SMTP, the standard protocol used in the Internet for mail transfer and forwarding), Usenet, news groups, instant messaging and <strong>File Transfer Protocol</strong>&nbsp;(FTP, a network protocol used to transfer data from one computer to another through a network, such as the Internet). So the Web is just a portion of the Internet, albeit a large portion, but the two terms are not synonymous and should not be confused.</p><p><strong>1.4.2 How the Internet came to be</strong></p><p>Up until the mid-1960s, most computers<br>were stand-alone machines that did not connect to or communicate with other<br>computers. In 1962 J.C.R. Licklider, a professor at The <strong>Massachusetts Institute<br>of Technology</strong>&nbsp;(<strong>MIT,</strong>&nbsp;a private, coeducational research<br>university located in Cambridge, Massachusetts.), wrote a paper envisioning a<br>globally connected “Galactic Network” of computers [4]. The idea was far-out at<br>the time, but it caught the attention of Larry Roberts, a project manager at<br>the U.S. Defense Department’s Advanced Research Projects Agency (ARPA). In 1996<br>Roberts submitted a proposal to ARPA that would allow the agency’s numerous and<br>different computers to be connected in a network similar to Licklider’s Galactic<br>Network.</p><p>Robert’s proposal was accepted, and work<br>began on the “ARPANET”, which would in time become what we know as today’s<br>Internet. The first “node” on the ARPANET was installed at UCLA in 1969 and<br>gradually, throughout the 1970s, universities and defense contractors working<br>on ARPA projects began to connect to the ARPANET.</p><p>In 1973, the U.S. Defense Advanced<br>Research Projects Agency (DARPA) initiated another research program to allow<br>networked computers to communicate transparently across multiple linked<br>networks. Whereas the ARPANET was just one network, the new project was<br>designed to be a “network of networks”. According to Vint Cerf, widely regarded<br>as one of the “fathers” of the Internet, the Internetting project and the<br>system of networks which emerged from the research were known as the “Internet”<br>[9].</p><p>It wasn’t until the mid 1980s, with the<br>simultaneous explosion in the use of personal computers, and the widespread<br>adoption of a universal standard of Internet communication called Transmission<br>Control Protocol/Internet Protocol (TCP/IP), that the Internet became widely<br>available to anyone desiring to connect to it. Other government agencies<br>fostered the growth of the Internet by contributing communications “backbones”<br>that were specifically designed to carry Internet traffic. By the late 1980s,<br>the Internet had grown from its initial network of a few computers to a robust<br>communications network supported by government and commercial enterprises<br>around the world.</p><p>Despite this increased acceptability,<br>the Internet was still primarily a tool for academics environment and<br>government contractors well into the early 1990s. As more and more computers are<br>connected to the Internet, users began to demand tools that would allow them to<br>search for and locate text and other files on computers anywhere on the net.</p><p><strong>1.4.3 Early Net Search Tools</strong></p><p>In this section, we will trace the development of the early Internet<br>search tools, and show how their limitations ultimately spurred the popular<br>acceptance of the web. This historical background, while it is very fascinating<br>in its own right, lays the foundation for understanding why the Invisible Web<br>could arise in the first place.</p><p>Although sophisticated search and<br>information retrieval techniques dated back to the late 1950s and early ‘60s,<br>these techniques were used primarily in closed or proprietary systems. Early<br>Internet search and retrieval tools lacked even the most basic capabilities,<br>primarily because it was thought that traditional information retrieval techniques<br>would not work on an open, unstructured information universe like the Internet.</p><p>Accessing a file on the Internet was a<br>two-part process. First, you needed to establish direct connection to the<br>remote computer where the file was located using a terminal emulation program<br>called Telnet. Telnet is a terminal emulation program that runs on your<br>computer allowing you to access a remote computer via a TCP/IP network and<br>execute commands on that computer as if you were directly connected to it. Many<br>libraries offered telnet access to their catalogs. Then you needed to use<br>another program, called a File Transfer Protocol (FTP) client, to fetch the<br>file itself. File Transfer Protocol (FTP) is a set of rules for sending and<br>receiving files of all types between computers connected to the Internet. For<br>many years, to access a file, it was necessary to know both the address of the<br>computer and the exact location and name of the file you were looking for, that<br>is, there were no search engines or other file-finding tools like the ones we<br>are familiar with today.</p><p>Thus “search” often meant sending a<br>request for help to an e-mail message list or discussion forum and hoping some<br>kind soul would respond with the details you needed to fetch the file you were<br>looking for. The situation improved somewhat with the introduction of<br>“anonymous” FTP servers, which were centralized file-servers specifically<br>intended for enabling the easy sharing of files. The servers were anonymous<br>because they were not password protected, that is, anyone could simply log on<br>and request any file on the system.</p><p>Files on FTP servers were organized in<br>hierarchical directories, much like files are organized in hierarchical folders<br>on personal computer systems today. The hierarchical structure made it easy for<br>the FTP server to display a directory listing of all the files stored on the<br>server, but you still needed good knowledge of the contents of the FTP server.<br>If the file you were looking for didn’t exist on the FTP server you were logged<br>into, you were out of luck.</p><p>The first true search tool for files<br>stored on FTP servers was called Archie, created in 1990 by a small team of<br>systems administrators and graduate students at McGill University in Montreal. Archie<br>was the prototype of today’s search engines, but it was primitive and extremely<br>limited compared to what we have today. Archie roamed the Internet searching<br>for files available on anonymous FTP servers, downloading directories listings<br>of every anonymous FTP server it could find. These listings were stored in a<br>central, searchable database called the Internet Archives Database at McGill<br>University, and were updated monthly.</p><p>Although it represented a major step<br>forward, the Archie database was still extremely primitive, limiting searches<br>to a specific file name, or for computer programs that performed specific<br>functions. Nonetheless, it proved extremely popular because nearly 50% of<br>Internet traffic to Montreal in the early ‘90s was Archie related, according to<br>Deutsch [10], who headed up the McGill University Archie team.</p><p>“In the brief period following the<br>release of Archie, there was an explosion of Internet-based research projects,<br>including WWW, Gopher, WAIS, and others” [4]. Each explored a different area of<br>the Internet information problem space, and each offered its own insights into<br>how to build and deploy Internet-based services. The team licensed Archie to<br>others, with the first shadow sites launched in Australia and Finland in 1992.<br>The Archie network reached a peak of 63 installations around the world by 1995.</p><p>Gopher, an alternative to Archie, was<br>created by Mark McCahill and his team at the University of Minnesota in 1991<br>and was named for the university’s mascot, the Golden Gopher. Gopher<br>essentially combined the Telnet and FTP protocols, allowing users to click<br>hyperlinked menus to access information on demand without resorting to<br>additional commands. Using a series of menus that allowed the user to drill<br>down through successively more specific categories, users could ultimately<br>access the full text of documents, graphics, and even music files, though not<br>integrated in a single format. Gopher made it easy to browse for information on<br>the Internet.</p><p>According to Gopher creator McCahill,<br>“Before Gopher there wasn’t an easy way of having the sort of big distributed<br>system where there were seamless pointers between stuff on one machine and<br>another machine”. You had to know the name of this machine and if you wanted to<br>go over here you had to know its name.</p><p><strong>“</strong>Gopher takes care of<br>all that stuff for you. So navigating around Gopher is easy. It points and<br>clicks typically. So it’s something that anybody could use to find things. It’s<br>also very easy to put information up so a lot of people started running servers<br>themselves and it was the first of the easy-to-use, no fuss, you can just crawl<br>around and look for information tools. It was the one that wasn’t written for<br>techies” [4].</p><p>Gopher’s “no muss, no fuss” interface<br>was an early precursor of what later evolved into popular Web directories like<br>Yahoo!. “Typically you set this up so that you can start out with a sort of<br>overview or general structure of a bunch of information, choose the items that<br>you’re interested in to move into a more specialized area and then either look<br>at items by browsing around and finding some documents or submitting searches”<br>[4].</p><p>A problem with Gopher was that it was designed to provide a listing of files available on computers in a specific location – the University of Minnesota, for example. While Gopher servers were searchable, there was no centralized directory for searching all other computers that were both using Gopher and connected to the Internet, or “Gopherspace” as it was called. In November 1992, Fred Barrie and Steven Foster of the University of Nevada System Computing Services group solved this problem, creating a program called Veronica, a centralized Archie-like search tool for Gopher files. In 1993 another program called Jughead added keyword search and Boolean operator capabilities to Gopher search. Keyword is a word or phrase entered in a query form that a search system attempts to match in text documents in its database. Boolean is a system of logical operators (AND, OR, NOT) that allows true-false operations to be performed on search queries, potentially narrowing or expanding results when used with keywords.</p> <br><p></p>

Blazingprojects Mobile App

📚 Over 50,000 Project Materials
📱 100% Offline: No internet needed
📝 Over 98 Departments
🔍 Software coding and Machine construction
🎓 Postgraduate/Undergraduate Research works
📥 Instant Whatsapp/Email Delivery

Blazingprojects App

Related Research

Computer Science. 2 min read

Adaptive Cybersecurity Threat Detection Using Machine Learning Techniques...

What This Project Is About This project focuses on developing a system that can detect cybersecurity threats, such as hacking attempts or malware, more effectiv...

BP
Blazingprojects
Read more →
Computer Science. 4 min read

AI-Powered Real-Time Language Translation System...

What This Project Is About This project involves creating a system that can understand and translate spoken language from one language to another instantly. The...

BP
Blazingprojects
Read more →
Computer Science. 2 min read

Developing an AI-Powered Personal Health Assistant Chatbot...

What This Project Is About This project focuses on creating a chatbot that uses artificial intelligence (AI) to help people manage their health. The chatbot wil...

BP
Blazingprojects
Read more →
Computer Science. 3 min read

Deep Learning-Based Real-Time Cybersecurity Threat Detection System...

This project is about creating a system that can automatically detect cybersecurity threats, such as hacking attempts or malware attacks, in real-time using adv...

BP
Blazingprojects
Read more →
Computer Science. 4 min read

Development of an AI-Powered Personalized Learning Platform...

This project is about creating a smart online learning platform that adapts to each student's individual needs and ways of learning. Traditional education metho...

BP
Blazingprojects
Read more →
Computer Science. 4 min read

Predicting Disease Outbreaks Using Machine Learning and Data Analysis...

The project topic, &quot;Predicting Disease Outbreaks Using Machine Learning and Data Analysis,&quot; focuses on utilizing advanced computational techniques to ...

BP
Blazingprojects
Read more →
Computer Science. 2 min read

Implementation of a Real-Time Facial Recognition System using Deep Learning Techniqu...

The project on &quot;Implementation of a Real-Time Facial Recognition System using Deep Learning Techniques&quot; aims to develop a sophisticated system that ca...

BP
Blazingprojects
Read more →
Computer Science. 3 min read

Applying Machine Learning for Network Intrusion Detection...

The project topic &quot;Applying Machine Learning for Network Intrusion Detection&quot; focuses on utilizing machine learning algorithms to enhance the detectio...

BP
Blazingprojects
Read more →
Computer Science. 3 min read

Analyzing and Improving Machine Learning Model Performance Using Explainable AI Tech...

The project topic &quot;Analyzing and Improving Machine Learning Model Performance Using Explainable AI Techniques&quot; focuses on enhancing the effectiveness ...

BP
Blazingprojects
Read more →
WhatsApp Click here to chat with us