Wednesday, April 27, 2022

Octoparse

Let’s focus on the Octoparse Web Scraping tool, which helps us quickly fetch data from any website without coding techniques and anyone can use this tool to build a crawler in just minutes as long as the data is visible on the web page. If you asked me in short words about this tool, I would say this is a “No-code (or) Low-code web scraping tool.”, It takes really substantial time and be good to cope with a web-scraping. Since most companies are busy maintaining a business, data related services with low-code web scraping tools for a better choice to improve their productivity.

Ultimately the primary reason always is that it saves time across all industries. Certainly, everyone can take the advantage of the interactive workflow and intuitive tips guide to build their own scrapers.

Octoparse can fulfil most of the data extractions requirements to scrape the data from different websites like E-commerce, Social-Media, Structured and Tabulated pages. And it has capable of satisfying use cases like price monitoring, social trend discovery, risk management and many more.

There are many features that are there, let’s discuss a few major in this article.

Hardware and Software Requirements

To run Octoparse on your system and to use the easy web-scraping workflow, your system only needs to fulfil the following requirements:

Operating Systems:

  • Win7/Win8/Win8.1/Win10(x64)
  • Mac users can download the Mac version of Octoparse directly from the website.

Software

  • Microsoft .NET Framework 3.5.

Internet Access

Environment of Octoparse

Let’s discuss the Octoparse environment, The Workspace is the place where we can build our set of tasks. There are four parts to it, each one plays its particular purpose.

  • The Built-in Brower: Once you’ve entered a target URL page, the webpage will be loaded in Octoparse’s built-in browser. you can browse any website in the browse mode of operation, or you can click to extract the data you need in Select mode.
  • The Workflow: To interact with the webpage(s), such as opening a web page, or clicking on a page element(s), the entire process is defined automatically in the form of a workflow.
  • Tips Box: It uses smart Tips to “talk” to you during the extraction process and to guide you through the task building process.
  • Data Preview: You can preview the data selected. It provided the option to rename the data fields or remove the undesirable items that are not needed.

The Octoparse installation package can be downloaded on the official website

How does Octoparse Work?

It automatically extracts the web page data by opening a web page and clicking the page like human browsing the page and starts extracting the data in a well-defined workflow and each action is pertaining to the target and objective of the purpose.

  • Simulation
  • Workflow
  • Extraction

Understand the Octoparse Interface

Since Octoparse provides a very rich and user-friendly interface, anyone can do the data extraction from any web page. I could recommend the favoured task template would satisfy most of the tasks in a few minutes. It would be the data for analysis from various classifications like – Products, Travel, social media, Search Engines, Jobs, Real Estate and Finance.

The main tabs are New, Dashboard, Data Services, Tools, and Tutorials

Let’s explore each item quickly.

Home Page

 

Octoparse Home Page

Workflow: The workflow-based design has been put in place in such a way, that it can be operated exclusively within GUI. Scripts or manual insertion of code is partly possible, but not necessary. In Octoparse Workflow there are two methods you can use that are nothing but Advanced Mode and the Template Mode.

Workflow

 

Task Templates: Which is used for pre-built tasks to get data by entering simple parameters like URL(s) or keywords. There are over 60+ templates for most mainstream websites. There is no need to build anything specifically and no technical chandelles. Simply select a template you need, check the sample data to see if it gets what you need, and extract the data right.

Task Templates
Task Templates

You can very well go into a group of templates specific to a country, based you can extract the data for use cases analysis.

Task Templates Image 3

 

Templates Configuration

  • Data preview: The data preview tab will help you to find out the list of items extracted during the process.
  • Parameter: The parameter would help you to get the URL to run for your data extraction.
  • Sample: This tab will give you the extracted data in tabular format.
  • Advanced Mode: The Advanced Mode use to take overall control of every step of your web scraping project.

Description

Templates Configuration
Templates Configuration Image 2

Extracted Data

Extracted Data

After a successful run, we could get all data in the tool and ready to further analysis.

Advanced Mode

Advanced Mode is a highly flexible and very powerful web-scraping mode than Task Templates. This is specifically for people who want to scrape from websites with complex structures for their project-specific.

The list of features of Advanced Mode:

  • Achieve the scraping data on almost all kinds of webpage
  • Extract data in different formats URL, image, and HTML
  • Design a workflow to interact with a web page such as login authentication, keyword searching
  • Customize your workflow, such as setting up a wait time, modifying XPath and reformatting the data extracted
Advanced Mode

There is a provision for Edit or Create your Advanced Workflow, which you could explore with the tool explicit. Here you could simulate real human browsing actions, such as the below steps.

  • Opening a web page
  • Clicking on a page element or button to extract data automatically.

The whole extraction process is defined automatically in a workflow with each step representing a particular instruction in the scraping task.

Dashboard: You can manage all your scraping tasks. Rename, Edit, Delete and Organize all the tasks. You can also conveniently schedule any tasks.

Dashboard

Extraction with the Octoparse Cloud

One of the nice features of Octoparse is a powerful Cloud platform for users can run their tasks 24/7. When you run a task with the “Cloud Extraction” option, it runs in the cloud with multiple servers using different IPs. Same time you can shut down your app or computer while the task is running. You need not worry about hardware and its limitation.

During this process data extracted will be saved in the cloud itself and can be accessed any time if you want. Here you can schedule the task, you can schedule your task to run as frequently as you need.

Octoparse cloud
Octoparse cloud Image 1

 

Octoparse cloud Image 2

Auto-data export: The tool provided the Auto-Data export provision to export data to the database, and it can be automated and scheduled. There are multiple options to configure this feature and enhance more on the data export.

Auto-data export

This tool also provides the refine of your data, like as below list of tasks:

  • Rename/move/duplicate/delete a field
  • Clean data
  • Capture HTML code
  • Extract page level-based data and date & time

You must know the Anti-Blocking settings which are available in this tool, few of them are below and you can very well add these to your workflow settings.

  • IP Blocking:
    • Since some websites might be very sensitive to web scraping and take some serious anti-scraping measures, by blocking IPs to stop any scraping activities.
  • Browser recognition & Cookie Tracking:
    • Every request made by a web browser is eventually blocked, due to the user-agent, With the help of this tool, you can easily enable automatic user-agent rotation in your crawler to reduce the risk of being blocked.

Data Services: where you and your team of web scraping experts can build the whole web scraping process customized for your needs.

Data Services

Conclusion

Guys, so far we have explored what is Web Scraping, Crawler in detail and the scope of both techniques and their significance during the data preparations stage, then we focused on the Octoparse tool and its key features right from its Hardware and Software Requirements, the Environment of Octoparse, How Octoparse works, Understanding of the Octoparse Interface, Key components – Workflow, Dashboard and Data Services, Extraction with the Octoparse are high demand, especially the Auto-data export – IP Blocking features are really major milestones during the process. Undoubtedly, this too would fulfil most of the data extractions requirements to scrape the data from different websites and always is that it saves time. Since the tool supports over 60+ predefined templates for most mainstream websites our job would be very simple. Hope you got the high-level details of the Octoparse tool and its benefits. You can very well install the same and explore more. Thanks for your time on this Web Scraping article.

Share:

Tuesday, April 26, 2022

Web-Scraping

This is the process of extracting the diverse volume of data (content) in the standard format from a website in slice and dice as part of data collection in Data Analytics and Data Science perspective in the form of flat files (.csv,.json etc.,) or stored into the database. The scraped data will usually be in a spreadsheet or tabular format. It can be also called as Web-Data-Extraction, Web -Harvesting, Screen Scraping etc.

Is this Legally accepted?

As long as you use the data ethically, this is absolutely fine. Anyways we’re going to use the data which is already available in most of the public domain, but sometimes the websites are wished to prevent their data from web scraping then they can employ techniques like CAPTCHA forms and IP banning.


Crawler & Scraper

Let’s understand Crawler & Scraper:

What is Web Crawling?

In simple terms, Web Crawling is the set process of indexing expected business data on the target web page by using a well-defined program or automated script to align business rules. The main objective goal of a crawler is to learn what the target web pages are about and to retrieve information from one or more pages based on the needs. These programs (Python/R/Java) or automated scripts are called in terms of a Web Crawler, Spider, and usually called Crawler.

What is Web Scraper?

This is the most common technique when dealing with data preparation during data collection in Data Science projects, in which a well-defined program will extract valuable information from a target website in a human-readable output format, this would be in any language.

Scope of Crawling and Scraping

Data Crawling:

  • It can be done at any scaling level
  • It gives to downloading web page reference
  • It requires a crawl agent
  • Process of using bots to read and store
  • Goes through every single page on the specified web page.
  • Deduplication is an essential part

Data Scraping:

  • Extracting data from multiple sources
  • It focuses on a specific set of data from a web page
  • It can be done at any scaling level
  • It requires crawl and parser
  • Deduplication is not an essential part.

In the next post we'll focus on the Octoparse Web Scraping tool.
Share:

Saturday, April 23, 2022

Reading the Data using Spark

 Lets continue from where we left in the previous post.

df_spark = spark_session.read.csv('sample_data/california_housing_train.csv')

In the above code, we can see that spark uses the Spark session variable to call the read.csv() function to read the data when it is in CSV format now if you remember when we need to read the CSV file in padas we used to call read_csv().

According to me when we are learning something new which has stuff related to previous learning then it is good to compare both the stuff so in this article, we will also compare the pandas’ data processing with spark’s data processing.

df_spark

Output:

DataFrame[_c0: string, _c1: string, _c2: string, _c3: string, _c4: string, _c5: string, _c6: string, _c7: string, _c8: string]

Now if we look at the output so it shows that it has returned the DataFrame in which we can see the dictionary like setup where (c0, c1, c2…..c_n) is the number of columns and corresponding to each column index we can see the type of that column i.e. String.

PySpark’s show() function

df_spark.show()

Output:


From the above output, we can compare the PySpark’s show() function with the pandas head() function.

  • In the head() function, we can see the top 5 records (unless we don’t specify it in the arguments) whereas in the show() function it returns the top 20 records which are mentioned too at the last.
  • Another difference that we can notice is the appearance of the tabular data that we can see using both the function one can compare it as in the start of the article I’ve used head() function.
  • One more major difference that one can point out which is also a drawback of PySpark’s show() function i.e. when we are looking at the column names it is showing (c0, c1, c2…..c_n), and the exact column names are shown as the first tuple of records but we can fix this issue as well.

So let’s fix it!

df_spark_col  = spark_session.read.option('header', 'true').csv('sample_data/california_housing_train.csv')

df_spark_col

Output:

DataFrame[longitude: string, latitude: string, housing_median_age: string, total_rooms: string, total_bedrooms: string, population: string, households: string, median_income: string, median_house_value: string]

Okay! so now just after looking at the output, we can say that we have fixed that problem (we will still confirm that later) as in the output instead of (c0,c1,c2….c_n) in the place of column name now we can see the actual name of the columns.

Let’s confirm it by looking at the complete data with records.

df_spark = spark_session.read.option('header', 'true').csv('sample_data/california_housing_train.csv').show()

df_spark

Output

Now we can see that in the place of the columns we can now see the actual name of the column instead that indexing formatting.


Share:

Friday, April 22, 2022

PySpark using Python

Apache Spark is a sort of engine which helps in operating and executing the data analysis, data engineering, and machine learning tasks both in the cloud as well as on a local machine, and for that, it can either use a single machine or the clusters i.e distributed system.


We already have some relevant tools available in the market which can perform the data engineering tasks so in this section we will discuss why we should choose Apache Spark over its other alternatives.


Features of Spark


  1. Streaming data: When we say streaming the data it is in the form of batch streaming and in this key feature Apache Spark will be able to stream our data in real-time by using our preferred programming language.
  1. Increasing Data science scalability: Apache Spark is one of the widely used engines for scalable computing and to perform Data science task which requires high computational power Apache Spark should be the first choice.
  1. Handling Big Data projects: As previously mentioned that it has high computational power so for that reason it can handle Big data projects in cloud computing as well using the distributed systems/clusters when working with the cloud, not on the local machines.
Installing PySpark Library using “pip”

pip install pyspark

Output:



After successfully installing PySpark


Importing PySpark Library

import pyspark

Reading the Dataset

Just before reading the dataset let me tell you that we will be working on the California House Price dataset and as I have worked on the Collab for handling PySpark operations so I got this dataset from the sample section.

import pandas as pd
data = pd.read_csv('sample_data/california_housing_test.csv')
data.head()

Output:


Now as we have imported the dataset and also have a look at it so now let’s start working with PySpark. But before getting real work with PySpark we have to start the Spark’s Session and for that, we need to follow some steps which are mentioned below.

  1. Importing the Spark Session from the Pyspark’s SQL object
  2. After importing the Spark session we will build the Spark Session using the builder function of the SparkSession object.
from pyspark.sql import SparkSession
spark_session = SparkSession.builder.appName('PySpark_article').getOrCreate()

Now as we can see that with the help of builder the function we have first called the appName class to name our session (here I have given *”PySpark_article”* as the session name) and at the last, for creating the session we have called getOrCreate() function and store it in the variable named spark_session.

spark_session

Output:


Now when we see what our spark session will hold it will return the above output which has the following components:

  1. About the spark session: In memory
  2. Spark context:
  • Version: It will return the current version of the spark which we are using – v3.2.1
  • Master: Interesting thing to notice here is when we will be working in the cloud then we might have different clusters as well like first, there will be a master and then a tree-like structure (cluster_1, cluster_2… cluster_n) but here as we are working on a local system and not the distributed one so it is returning local.
  • AppName: And finally the name of the app (spark session) which we gave while declaring it.
Next we'll see how to read data using Spark.
Share:

Thursday, April 21, 2022

Qubits

The qubit is short for “quantum bit.” While a bit can only be 0 or 1, a qubit can exist in more states. Qubits are surprising, fascinating, and powerful. They follow strange rules which may not initially seem natural to you. According to physics, these rules may be how nature itself works at the level of electrons and photons.

A qubit starts in an initial state. We use the notation |0⟩ and |1⟩ when we are talking about a qubit instead of the 0 and 1 for a bit. For a bit, the only non-trivial operation you can perform is switching 0 to 1 and vice versa. We can move a qubit’s state to any point on the sphere shown in the center of Figure 1.9. We can represent more information and have more room in which to work.


This sphere is called the Bloch sphere, named after physicist Felix Bloch. Things get even better when we have multiple qubits. One qubit holds two pieces of information, and two qubits hold four. That’s not surprising, but if we add a third qubit, we can represent eight pieces of information. Every time we add a qubit, we double its capacity. For 10 qubits, that’s 1,024. For 100 qubits, we can represent 1,267,650,600,228,229,401,496,703,205,376 pieces of information. This illustrates exponential behavior since we are looking at 2the number of qubits.

Some qubit features-

  • While we can perform operations and change the state of a qubit, the moment we look at the qubit, the state collapses to 0 or 1. We call the operation that moves the qubit state to one of the two bit states “measurement.”
  • Just as we saw that bits have meaning when they are parts of numbers and strings, presumably the measured qubit values 0 and 1 have meaning as data.
  • Probability is involved in determining whether we get • 0 or 1 at the end. 
  • We use qubits in algorithms to take advantage of their exponential scaling and other underlying mathematics. With these, we hope eventually to solve some significant but currently intractable problems. These are problems for which classical systems alone will never have enough processing power, memory, or accuracy.

Scientists and developers are now creating quantum algorithms for use cases in financial services and artificial intelligence (AI). They are also looking at precisely simulating physical systems involving chemistry. These may have future applications in materials science, agriculture, energy, and healthcare.

Share: