How to mine and fetch data using Rust

When the API can't do this, you can always dig into the HTML, and Rust can help you do the web fetching.

When the API can't do this, you can always dig into the HTML, and Rust can help you do the web fetching.

How to mine and fetch data using Rust Picture 1How to mine and fetch data using Rust Picture 1

Web scraping or data mining is a popular, fast and effective technique for gathering big data from web pages. Without AI, web scraping might be the best approach.

Rust's speed and memory safety make it ideal for building web data miners. Rust is the 'home' of many powerful parsing and data extraction libraries. Its professional error handling capabilities help in efficient & reliable web crawling.

Web Data Mining in Rust

Many popular libraries support web data mining in Rust, including reqwest, scraper, select, and html5ever. Most Rust developers combine features from reqwest and scraper for their web mining.

The reqwest library provides the functionality to generate HTTP queries for web servers. Reqwest is built on crate hyper in passing a high-level API for standard HTTP features.

 

Scraper is a powerful web mining library that parses HTML and XML documents and extracts data using CSS selectors & XPath expressions.

After creating a new Rust project with the cargo new command , add crate reqwest & scraper to the dependencies of the cargo.toml file :

[dependencies] reqwest = {version = "0.11", features = ["blocking"]} scraper = "0.12.0"

You will use reqwest to send HTTP queries and scraper for parsing.

Retrieve web pages using Reqwest

You will query the website's content before analyzing it to extract specific data.

You can send a Get query and output the HTML source of a page using the text function on the get function of the reqwest library :

fn retrieve_html() -> String {     let response = get("https://news.ycombinator.com").unwrap().text().unwrap();     return response; }

The get function sends a query to the web page and the text function returns the content of the HTML.

Parsing HTML with Scraper

The retrieve_html function returns the content of the HTML. You will need the integral to output the desired data.

Scraper provides HTML interaction in Html and Selector modules . The Html module gives you the ability to parse the document, while the Selector module selects specific elements in the HTML .

Here's how you can output all titles on a page:

use scraper::{Html, Selector}; fn main() {     let response = reqwest::blocking::get(         "https://news.ycombinator.com/").unwrap().text().unwrap();     // phân tích tài liệu HTM     let doc_body = Html::parse_document(&response);     // chọn phần tử chứa class titleline     let title = Selector::parse(".titleline").unwrap();              for title in doc_body.select(&title) {         let titles = title.text().collect:: ();         println!("{}", titles[0])     } } 

The Html module 's parse_document function parses the HTML content, and the Selector module's Parse selects the elements containing the specified CSS selector (here, the titleline class).

The for loop iterates over these elements and prints the first block of text from each element.

 

Here are the results:

How to mine and fetch data using Rust Picture 2How to mine and fetch data using Rust Picture 2

Select properties with Scraper

To select an attribute value, output the required elements and use the attr method of the tag value version:

use reqwest::blocking::get; use scraper::{Html, Selector}; fn main() {     let response = get("https://news.ycombinator.com").unwrap().text().unwrap();     let html_doc = Html::parse_document(&response);     let class_selector = Selector::parse(".titleline").unwrap();     for element in html_doc.select(&class_selector) {         let link_selector = Selector::parse("a").unwrap();         for link in element.select(&link_selector) {             if let Some(href) = link.value().attr("href") {                 println!("{}", href);             }         }     } }

After selecting the elements using the titleline class using the parse function, the for loop will cycle through them. Inside the loop, this code fetches the a tags and selects the href attribute with the attr attribute .

The main function prints these links with the following result:

How to mine and fetch data using Rust Picture 3How to mine and fetch data using Rust Picture 3

Above is how to mine and fetch web data using Rust . Hope the article is useful to you.

4 ★ | 1 Vote