Data Scraping in Rails by Processing CSV

The ruby on rails Application to scrape the link uploaded from CSV file and

find the occurance of link in particular page.

In the application user need to pass a csv and list of users email to whom the parsed CSV will be sent.

In the csv there will be three 2 column:

  • refferal_link
  • home_link
  • and there values like below

First of all we will create the rails application

$ rails new scrape_data

$ cd scrape_data

Then we will genrate the UploadCsv module, run the below command

$ rails g scaffold UploadCsv generated_csv:string csv_file:string

That will create All the required model, controller and migrations for csv_file

Then we will start by first upload the file in DB

replace the below code in files  app/views/upload_csvs/_form.html.erb

we added the below code to upload file in view

<%= form_with(model: upload_csv, local: true) do |form| %>

  <% if upload_csv.errors.any? %>

    <div id=”error_explanation”>

      <h2><%= pluralize(upload_csv.errors.count, “error”) %> prohibited this upload_csv from being saved:</h2>

      <ul>

        <% upload_csv.errors.full_messages.each do |message| %>

          <li><%= message %></li>

        <% end %>

      </ul>

    </div>

  <% end %>

  <div class=”field”>

    <%= form.label :csv_file %>

    <%= form.file_field :csv_file %>

  </div>

  <div class=”actions”>

    <%= form.submit %>

  </div>

<% end %>

Then we will add the gem for upload a csv_file

add the below line in gem file

gem ‘carrierwave’, ‘~> 2.0’

$ bundle install

Then we will create the uploader in carrierwave

$ rails generate uploader Avatar

we will attach the uploader in model

app/models/upload_csv.rb

class UploadCsv < ApplicationRecord

  mount_uploader :csv_file, AvatarUploader

end

before moving further just check your application is working

run below commands

$ rake db:create db:migrate

update the routes

Rails.application.routes.draw do

  resources :upload_csvs

  root ‘upload_csvs#index’

end

$ rails s

Then we will create a Job to read the CSV file and scrape the link from it

and genrated file will be save in generated_csv column of that records

for genearting the job we will do like below

$ rails generate job genrate_csv

add the below gem and run bundle install

gem ‘httparty’

gem ‘nokogiri’

then we will replace the code with below

class GenrateCsvJob < ApplicationJob

  queue_as :default

  def perform(upload_csv)

    processed_csv(upload_csv)

    file = Tempfile.open([“#{Rails.root}/public/generated_csv”, ‘.csv’]) do |csv|

      csv << %w[referal_link home_link count]

      @new_array.each do |new_array|

        csv << new_array

      end

      file = “#{Rails.root}/public/product_data.csv”

      headers = [‘referal_link’, ‘home_link’, ‘count’]

      file = CSV.open(file, ‘w’, write_headers: true, headers: headers) do |writer|

        @new_array.each do |new_array|

          writer << new_array

        end

        upload_csv.update(generated_csv: file)

      end

    end

    NotificationMailer.send_csv(upload_csv).deliver_now! if @new_array.present?

    #need to genrate the mailer and follow the mailer steps

  end

  # Method to get the link count and stores in the array

  def processed_csv(upload_csv)

    @new_array = []

    CSV.foreach(upload_csv.csv_file.path, headers: true, header_converters: :symbol) do |row|

      row_map = row.to_h

      page = HTTParty.get(row_map[:refferal_link])

      page_parse = Nokogiri::HTML(page)

      link_array = page_parse.css(‘a’).map { |link| link[‘href’] }

      link_array_group = link_array.group_by(&:itself).map { |k, v| [k, v.length] }.to_h

      @new_array.push([row_map[:refferal_link], row_map[:home_link], (link_array_group[row_map[:home_link]]).to_s])

    end

  end

end

Then we will attach the job after_create of upload_csvs and we will add the validation for csv_file require

 please update the code of  app/models/upload_csv.rb

class UploadCsv < ApplicationRecord

  mount_uploader :csv_file, AvatarUploader

  after_create :processed_csv

  def processed_csv

    GenrateCsvJob.perform_later(self)

  end

end

then check after uploding file your scrape genrated file will be updated you can check generated csv

inside  /scrape_data/public/product_data.csv

we can send through email by using below instruction

First of we will genrate the mailer

$ rails generate mailer NotificationMailer

update the code of app/mailers/notification_mailer.rb

  def send_csv(upload_csv)

    @greeting = ‘Hi’

    attachments[‘parsed.csv’] = File.read(upload_csv.generated_csv)

    mail(to: “sample@gmail.com”, subject: ‘CSV is parsed succesfully.’)

  end

end

please configure the mail configure also config/environments/development.rb or production.rb

add below lines in the file

config.action_mailer.default_url_options = { host: ‘https://sample-scrape.herokuapp.com/’ }

config.action_mailer.delivery_method = :smtp

config.action_mailer.smtp_settings = {

  user_name: ‘sample@gmail.com’,

  password: ‘*******123456’,

  domain: ‘gmail.com’,

  address: ‘smtp.gmail.com’,

  port: ‘587’,

  authentication: :plain

}

config.action_mailer.raise_delivery_errors = false

and update the view also app/views/notification_mailer/send_csv.html.erb

<h1>CSV has been processed, Thanks!</h1>

<p>

  <%= @greeting %>, Please check attachment to recieve the email

</p>

Thank you !