Google is still monopolizing the online search market.
First challenge: Web 2.0 Data
Needless to say, the boom of Web 2.0 followed the trend of users participating in creating content on the Internet. This explosion is a big challenge with data processing in search engine systems.
About 5 years ago, data has increased dramatically in quantity. This amount of data exists on forums, blogs, wikis, social network, multimedia service, . along with a huge amount of junk data being created every day. With distributed storage and processing solutions, the current search engine solves quite this task perfectly. However, the data growth rate is not only non-stop but also stronger. With it, bandwidth costs, storage hardware, software capacity, energy, . maintaining data centers will increase, posing price challenges: development prices and operating prices. search engine system.
Not every administrator wants Google to index their Website. And there are also many sites that make users who register for new members allow viewing content. Every time Crawler visits the site, once the Web site is consuming huge bandwidth, it can go far beyond the number of people entering the page every day.
Network services 2.0 boom, personal data is born and exist on the Internet more and more. They can be protected by privacy or difficult to query by the regular link (link) method. Moreover, many Web 2.0 uses scripts to generate URLs or choose Post (HTTP Post) methods when querying data. So how can search engines scan all data on the Internet? This is a difficult challenge for the Search Engine Crawler. Of course, not until 2.0, the new Search Engine faces Invisible Web but when the data service network explodes, Invisible web becomes more complex and complicated.
Updating real-time data is also a need. Articles posted on CNN can be easily found as soon as a query is sent to Google. Real-time updates with a small amount of data are not difficult, but search engines have to handle billions of data types in a day. This is a challenge for indexing in Search Engine systems.
The second challenge: Filter out junk and duplicate data
As mentioned above, storage costs must be calculated to be the cheapest. Therefore, the less data drafted, the less duplicate data is possible. These data also affect the speed of access and quality of search results returned to users. While the problem of spam is raging, the problem of garbage data also hurts search engines. Garbage and duplicate data include:
- A data is pointed from many different links in the Website or changed with each access due to the management mechanism about the session.
- Data is duplicated by human or automatically spread by software.
- SEO (Search Engine Optimization - optimization of search results by tricks to appear in high position) and the creation of tags makes the machine look for confusion in evaluating data values.
For example, many Vietnamese websites that use vBulletin to create forums for their communities have been created by spam tools from Russia with a series of unhealthy content on them. Another case, Yahoo Search provides API for Meta Data Search, but Google indexes them and these results can be returned when users query to Google Search.
Will the modern search engine be good enough to handle all these problems?
The third challenge: Vertical Search and data query patterns
With a keyword, we can get back hundreds of millions of Web pages that contain it from the Search Engine. But really, we don't need that much. For example, when looking for the word Nokia, I want to get results from sites selling old Nokia phones. Please do not return news or store addresses for new Nokia phones. The "skill" search is used to again user self-filtered results returned. But the future will not be so easy. Billions and billions of websites can be paid to you. Every skill becomes useless with too much data.
Therefore, Search Engine must know the localization of data. More specifically, break down the data range to limit the scope of the query to help users reach faster and more accurately. In addition, due to specific data, news, commodity prices, securities, job search, . need to update regularly with faster speed of discussion on forums or blogs. Splitting data areas is both in terms of scanning, storing and searching data.
Old generation search engine: Give me the keyword, I give you the site that contains it?
Modern users: Please give me interesting results, not all!
Oh! Yes. Smart users require more flexible ways of querying data, not just search keywords. Queries that filter will help search engines return better results. The queries require screening of the time, field, place, character, . and especially screening according to the native aspect. East Asian culture is very different from Western Asian culture. The East and the West signify different social norms, which affects access to different information. Therefore, the search engine must be smart to satisfy all requests from many parts of the world.
Challenge 4: Increase semantics
The semantics, firstly mentioned in two aspects:
- Add related content in search results.
- Extracting ideas or summarizing content helps users access or review faster.
The Internet is like a spider web both physically and on websites. The binding links, knitting views, pointing to each other form a network. If you think a little further, data exists on the Internet as well. Referring to H1N1 flu, the content exists both in blogs, forums, news, . or any other type of message. Users can create links so they point to each other, but if search engines know how to assemble and link them, it will be much more efficient. Data constraints help search engines return web pages that contain the word "Mexican flu" even though users only enter the word "H1N1".
Extraction, information or summary of content is also a need in search engines. For example, searching for resumes, users will quickly want to access information such as salary, request years of experience, . Another example is with classified news, price, electricity. voice and contact address, time for sale, . will be essential for users. Solving this technique, Search Engine will save a lot of time for users. But time is money.
Data mining, with a small amount of data, the current technology has sufficiently solved the requirement. But a huge data set is a challenge. Data mining theory laid the foundation for a long time but the application to the current search system is not much or has not shown value. Hopefully, in the near future, users will enjoy this ability.
Challenge 5: Search engines not only index the web
Users are getting used to getting things into search engines. For example, I want it to solve a third equation with graphs, pricing a housing area in the center of District 3 - Ho Chi Minh City, . Google from the early days has grouped this idea. Try typing "1 + 1" into Google search to see.
However Google hasn't or hasn't developed in that direction yet. Wolfram was born as a supplement. Please drill. Please do not try to differentiate Wolfram from Google. Try incorporating both services to become a service that can answer everything. Meanwhile, the search engine will become a great brain capable of calculating, reasoning and remembering a huge amount of information for humans.
Do you want to use such a service? Is it too ideal? Please wait by Bing.com, Google.com or Wolframalpha.com in the future can become such a search engine.
Note :
- Crawler: The software program scans Web pages for indexing for search.
- Indexing: Indexing content according to sub-keywords for search by keywords.
- Invisible Web: Web hidden data. Querying data is not via normal addresses or links, they can be queried through data entry or link concealment methods.
- Data mining: An area in the computing industry for data semantic analysis.