comparing 2 data sets possibly with concurrency/asynchronous/parallel approach
Asked Answered
Z

1

11

I am currently trying to improve upon an existing mechanism (to compare data from 2 sources, implemented in perl5) and would like to use perl6 instead.

My target data volume range is about 20-30 GB in uncompressed flat files. In terms of lines, a file can contain anywhere from 18 million to 28 million lines. It has around 40-50 columns per line.

I do this type of data reconciliation on a daily basis and it can take about ~10 minutes to read from a file and populate the hash. ~20 minutes spent to read both files and to populate hash.

comparison process takes about ~30-50 minutes including iterating over hash, collecting desired result(s), and writing to output file (csv,psv).

All in all it can take anywhere between 30 minutes to 60 minutes on a 32 core dual xeon cpu server with 256gb of RAM, including intermittent server load, to perform the process.

Now I am trying to bring down the total processing time even further.

Here is my current single threaded approach using perl5.

  1. fetch data from 2 sources (let's say s1 and s2) one by one and populate my hash based on key-value pairs. Source of data could be either a flat csv or psv file OR a database query Array of Array result, via DBI client. Data is always unsorted to start with.
  2. To be specific, I read the file line by line,split fields, and choose desired indexes for key,value pair and insert into hash.
  3. After collecting data and populating hash with desired key/value pairs,I start to compare and collect results (mainy comparing on what is missing or different in s2 w.r.t s1 and vice-versa).
  4. dump output in an excel file (very costly if no. of lines is large like ~1 million or greater) or in a simple CSV (cheap operation. preferred method).

I was wondering whether if I could somehow do the first step in parallel i.e. collect data from both sources at once and populate my global hash, and then proceed to compare and dump output?

What options can perl6 provide to deal with this situation? I have read about concurrency, asynchronous and parallel operations using perl6 but I am not so certain which one can help me here.

I would really appreciate any general guidance on the matter. I hope I explained my problem well but sadly I don't have much to show for what have I tried till now? and reason is that I am just beginning to tackle this one. I am just unable to see past the single threaded approach and need some help.

Thanks.

EDIT

As my existing problem statement has been deemed by the community as 'too broad' - allow me to attempt to highlight my pain points below:

  1. I would like to do file comparison by utilizing all 32 cores if possible. I am just not able to come up with a strategy or initial idea.
  2. What type of new techniques are available or applicable with perl6 in order to tackle this problem or type of problem.
  3. If I spawn 2 processes to read file(s) and collect data - is it possible to get the result back as an array or hash?
  4. Is it possible to compare the data (stored in hash) in parallel?

My current p5 comparison logic is shown below for your reference. Hope this helps and not let this question shutdown.

package COMP;

use strict;
use Data::Dumper;


sub comp 
{
  my ($data,$src,$tgt) = @_; 
  my $result = {};

  my $ms    = ($result->{ms} = {});
  my $mt    = ($result->{mt} = {});
  my $diff  = ($result->{diff} = {});

  foreach my $key (keys %{$data->{$src}})
  {
    my $src_val = $data->{$src}{$key};
    my $tgt_val = $data->{$tgt}{$key};

    next if ($src_val eq $tgt_val);

    if (!exists $data->{$tgt}{$key}) {
      push (@{$mt->{$key}}, "$src_val|NULL");
    }
    if (exists $data->{$tgt}{$key} && $src_val ne $tgt_val) {
      push (@{$diff->{$key}}, "$src_val|$tgt_val") 
    }
  }

  foreach my $key (keys %{$data->{$tgt}})
  {
    my $src_val = $data->{$src}{$key};
    my $tgt_val = $data->{$tgt}{$key};

    next if ($src_val eq $tgt_val);

    if (!exists $data->{$src}{$key}) {
      push (@{$ms->{$key}},"NULL|$tgt_val");
    }
  } 

  return $result;
}

1;

If someone would like to try it out, here is the sample output and the test script used.

script output

[User@Host:]$ perl testCOMP.pl 
$VAR1 = {
          'mt' => {
                    'Source' => [
                                  'source|NULL'
                                ]
                  },
          'ms' => {
                    'Target' => [
                                  'NULL|target'
                                ]
                  },
          'diff' => {
                      'Sunday_isit' => [
                                         'Yes|No'
                                       ]
                    }
        };

Test Script

[User@Host:]$  cat testCOMP.pl 
#!/usr/bin/env perl

use lib $ENV{PWD};
use COMP;
use strict;
use warnings;
use Data::Dumper;

my $data2 = {
  f1 => {
    Amitabh => 'Bacchan',
    YellowSun => 'Yes', 
    Sunday_isit => 'Yes',
    Source => 'source',
  },
  f2 => {
    Amitabh => 'Bacchan',
    YellowSun => 'Yes', 
    Sunday_isit => 'No',
    Target => 'target',
  },
};

my $result = COMP::comp ($data2,'f1','f2');
print Dumper $result;
[User@Host:]$ 
Zellers answered 29/8, 2018 at 16:37 Comment(15)
Are s1 and s2 sorted or sortable? In other words, can you sort them such that you start with the first line/row in s1, and then can immediately know whether the first row in s2 comes either before that and is irrelevant to your comparison logic, or is a corresponding row though perhaps with some columns missing or not matching, or comes after?Lichenology
s1 and s2 are not sorted. Your sorting idea seems doable but should I use the bash/shell sort sort them after ingestion, inside the script? I have never tried this approach before.Zellers
If I remember correctly sorting such a huge file using system sort is a costly affair but let me check it again and come back with exact numbers. Out of curiosity, can the comparison algorithm remain same or will it require tweaking or change if we start with sorted files?Zellers
I can't tell if the algo would need to change. You haven't provided enough detail. (Oh wow. I see 4 people have voted to close your question for being too broad. One more and it gets closed. Voters, please be patient! @Zellers please add more detail!) If you add to your question a link to a gist at gist.github.com or whatever that's a copy of your P5 code, or at least the comparison logic, that would help me even if your question gets closed. Failing that, give more detail about the logic of "comparing on what is missing or different in s2 w.r.t s1 and vice-versa".Lichenology
One last thing. It's entirely possible that sorting (which of course requires comparison) and then comparing again is significantly slower than the ideal even if much faster than your current performance. So, to be clear, I'm not suggesting you invest a lot of effort into establishing a sort benchmark. Do it, yes, and make sure to try a sort program that explicitly claims to make good use of multiple cores and lots of RAM, and ensure it definitely does indeed use most/all of your cores, and then stop. Maybe it'll be part of a final solution. Maybe it'll just benchmark what's possible.Lichenology
This is still really broad. You have a volume of data there. It's not bad, but asking 4 questions at once is still a lot. Consider either breaking this up into smaller programming chunks or try over on SoftwareEngineers.SEGastight
Please note that in its current versions, Perl6 is quite a bit slower than Perl5 in processing large files, and you may need to write in lower-level code to get acceptable performance (even when parallelized). See the accepted answer to that question for an outline.Appetency
@raiph: thank you for showing concern for my reported issue. Your gesture to assist me with my reported problem/issue, with a sense of urgency, is much appreciated. Tags are totally OK and I had no clue that tag value has brought me all this heat, thanks for removing it. My issue is not so pressing and I can take some time to formulate a good strategy to attack this problem. Any performance gain greater than ~10% is acceptable to start with. I am also open to redo this in p5, if there is a way. I got all the time to spend on this problem. I enjoy doing stuff like this ;)Zellers
@Joshua: May I know why the tag value has brought me all this heat? I was not able to understand that statement. About my question being too broad - I thought about it and the way I saw it in my head is that it is very much a perl problem as I was unable to decide upon which perl5/6 functions can help me with my performance enhancement goal, probably also utilizing said approaches. Never did I intend to stuff 4 questions into one and if I came off as I did - give me a chance to change that. I just want to learn more and move on! my requirement is just for new ideas as I am totally tapped out.Zellers
The tag wiki says do not use the value tag because it is being burninated (removed). This is a manual process and we have many people combing the lists and aren't as tolerant after 50 terrible questions.Braun
@Machavity: do you mean that this type of design/approach related questions are not a good fit for SO?Zellers
@Zellers SO is aimed more towards simple coding Q&A ("What's wrong with this code?"). Software Engineers is aimed more towards a programming "big picture" (note that's listed in their "on topic" section). Asking 4 questions (and they're not entirely related) is a lot to answer as well. I don't think it's bad, it just needs to be broken up and asked in the proper spots.Gastight
OK, didn't get time to get to this and it's late, but I have a couple questions. First, what timezone are you in (I'm UK, it's 1.47 am, time for sleep), second, have you gotten a benchmark and RAM/core usage profile for sorting one of the files using a multi-core sort?Lichenology
@raiph: sorry for the delayed response. Yes we have a timezone gap!. I am GMT +5:30 i.e. India time zone. Single core sort is not so useful. I am working on the multi-core sort and I haven't got it worked out yet. I am happy to take this on chat as you suggested.Zellers
Hi again. Did you get a multicore sort working? If so, what was the time like? Or maybe implement en.wikipedia.org/wiki/Merge_sort#Parallel_merge_sort? Assuming sorted input files, the algo will be to read the input lazily, run comparisons of matching rows on a queuing / worker basis, and write the result lazily. I'm part way there with some P6 code but want to have it doing the whole thing (for test files) before I share it with you. I'm sure there'll be a way to do this in P5 too. I'm not sure how to start an SO chat room unless it offers a link to one. Anywhere else we can talk?Lichenology
S
1

If you have an existing and working toolchain you don't have to rewrite it all to use Perl6. It's parallelism mechanisms work fine with external processess too. Consider

allnum.pl6

use v6;

my @processes = 
    [ "num1.txt", "num2.txt", "num3.txt", "num4.txt", "num5.txt" ]
        .map( -> $filename {
            [ $filename, run "perl", "num.pl", $filename, :out ];
        })
        .hyper;

say "Lazyness Here!";
my $time = time;
for @processes
{
    say "<{$_[0]} : {$_[1].out.slurp}>";
}
say time - $time, "s";

num.pl

use warnings;
use strict;

my $file = shift @ARGV;
my $start = time;
my $result = 0;

open my $in, "<", $file or die $!;
while (my $thing = <$in>)
{
    chomp $thing;
    $thing =~ s/ //g;
    $result = ($result + $thing) / 2;
}
print $result, " : ", time - $start, "s";

On my system

C:\Users\holli\tmp>perl6 allnum.pl6
Lazyness Here!
<num1.txt : 7684.16347578616 : 3s>
<num2.txt : 3307.36261498186 : 7s>
<num3.txt : 5834.32817942962 : 10s>
<num4.txt : 6575.55944995197 : 0s>
<num5.txt : 6157.63100049619 : 0s>
10s

Files were set up like so

C:\Users\holli\tmp>perl -e "for($i=0;$i<10000000;$i++) { print chr(32) ** 100, int(rand(1000)), chr(32) ** 100, qq(\n); }">num1.txt
C:\Users\holli\tmp>perl -e "for($i=0;$i<20000000;$i++) { print chr(32) ** 100, int(rand(1000)), chr(32) ** 100, qq(\n); }">num2.txt
C:\Users\holli\tmp>perl -e "for($i=0;$i<30000000;$i++) { print chr(32) ** 100, int(rand(1000)), chr(32) ** 100, qq(\n); }">num3.txt
C:\Users\holli\tmp>perl -e "for($i=0;$i<400000;$i++) { print chr(32) ** 100, int(rand(1000)), chr(32) ** 100, qq(\n); }">num4.txt
C:\Users\holli\tmp>perl -e "for($i=0;$i<5000;$i++) { print chr(32) ** 100, int(rand(1000)), chr(32) ** 100, qq(\n); }">num5.txt
Simar answered 5/9, 2019 at 11:55 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.