Monitoring Process of Cases[] on a Very Large Body of Information
Asked Answered
E

4

13

I'm currently undertaking operations on a very large body of text (~290MB of plain text in one file). After importing it into Mathematica 8, I'm currently beginning operations to break it down into lowercase words, etc. so I can begin textual analysis.

The problem is that these processes take a long time. Would there be a way to monitor these operations through Mathematica? For operations with a variable, I've used ProgressIndicator etc. But this is different. My searching of documentation and StackOverflow has not turned up anything similar.

In the following, I would like to monitor the process of the Cases[ ] command:

input=Import["/users/USER/alltext.txt"];
wordList=Cases[StringSplit[ToLowerCase[input],Except[WordCharacter]],Except[""]];
Electrophilic answered 18/10, 2011 at 2:51 Comment(2)
I wonder if your question is about monitoring the Cases[] progress, or about optimizing your code. They are two entirely unlike problemsLibertarian
@belisarius Almost, but not entirely.. I gather from the responses that my need/request to monitor Cases[] stems from some slower choices in my code. Also, perhaps there is no readily apparent way to monitor such progress..Electrophilic
C
10

It is possible to view the progress of the StringSplit and Cases operations by injecting "counter" operations into the patterns being matched. The following code temporarily shows two progress bars: the first showing the number of characters processed by StringSplit and the second showing the number of words processed by Cases:

input = ExampleData[{"Text", "PrideAndPrejudice"}];

wordList =
  Module[{charCount = 0, wordCount = 0, allWords}
  , PrintTemporary[
      Row[
        { "Characters: "
        , ProgressIndicator[Dynamic[charCount], {0, StringLength@input}]
        }]]

  ; allWords = StringSplit[
        ToLowerCase[input]
      , (_ /; (++charCount; False)) | Except[WordCharacter]
      ]

  ; PrintTemporary[
      Row[
        { "Words:      "
        , ProgressIndicator[Dynamic[wordCount], {0, Length@allWords}]
        }]]

  ; Cases[allWords, (_ /; (++wordCount; False)) | Except[""]]

  ]

The key to the technique is that the patterns used in both cases match against the wildcard _. However, that wildcard is guarded by a condition that always fails -- but not until it has incremented a counter as a side effect. The "real" match condition is then processed as an alternative.

Criss answered 18/10, 2011 at 13:59 Comment(6)
+1 just be aware that in my machine this is 7 times slower than the same non monitored codeLibertarian
Very clever! +1. You can simply use Except[WordCharacter]/;(++charCount;True) and Except[""] /; (++wordCount; True) instead of (_ /; (++charCount; False)) | Except[WordCharacter] and (_ /; (++wordCount; False)) | Except[""] with the same success but with more efficiency. Usage of DeleteCases instead of Cases may give even more speedup as Joshua Martell points out.Intranuclear
@Alexey That is what I tried at first, but it did not count all characters and words -- only those that matched the pattern.Criss
Addition: usage of /; NumberQ[++charCount] and /; NumberQ[++wordCount] gives a little even more speedup and shorter code.Intranuclear
@Criss Now I understand what you mean. Interesting.Intranuclear
Fantastic - let me give this a try as well. This is a really interesting approach that has broad application.Electrophilic
G
11

Something like StringCases[ToLowerCase[input], WordCharacter..] seems to be a little faster. And I would probably use DeleteCases[expr, ""] instead of Cases[expr, Except[""]].

Greenes answered 18/10, 2011 at 4:3 Comment(0)
C
10

It is possible to view the progress of the StringSplit and Cases operations by injecting "counter" operations into the patterns being matched. The following code temporarily shows two progress bars: the first showing the number of characters processed by StringSplit and the second showing the number of words processed by Cases:

input = ExampleData[{"Text", "PrideAndPrejudice"}];

wordList =
  Module[{charCount = 0, wordCount = 0, allWords}
  , PrintTemporary[
      Row[
        { "Characters: "
        , ProgressIndicator[Dynamic[charCount], {0, StringLength@input}]
        }]]

  ; allWords = StringSplit[
        ToLowerCase[input]
      , (_ /; (++charCount; False)) | Except[WordCharacter]
      ]

  ; PrintTemporary[
      Row[
        { "Words:      "
        , ProgressIndicator[Dynamic[wordCount], {0, Length@allWords}]
        }]]

  ; Cases[allWords, (_ /; (++wordCount; False)) | Except[""]]

  ]

The key to the technique is that the patterns used in both cases match against the wildcard _. However, that wildcard is guarded by a condition that always fails -- but not until it has incremented a counter as a side effect. The "real" match condition is then processed as an alternative.

Criss answered 18/10, 2011 at 13:59 Comment(6)
+1 just be aware that in my machine this is 7 times slower than the same non monitored codeLibertarian
Very clever! +1. You can simply use Except[WordCharacter]/;(++charCount;True) and Except[""] /; (++wordCount; True) instead of (_ /; (++charCount; False)) | Except[WordCharacter] and (_ /; (++wordCount; False)) | Except[""] with the same success but with more efficiency. Usage of DeleteCases instead of Cases may give even more speedup as Joshua Martell points out.Intranuclear
@Alexey That is what I tried at first, but it did not count all characters and words -- only those that matched the pattern.Criss
Addition: usage of /; NumberQ[++charCount] and /; NumberQ[++wordCount] gives a little even more speedup and shorter code.Intranuclear
@Criss Now I understand what you mean. Interesting.Intranuclear
Fantastic - let me give this a try as well. This is a really interesting approach that has broad application.Electrophilic
W
5

It depends a little on what your text looks like, but you could try splitting the text into chunks and iterate over those. You could then monitor the iterator using Monitor to see the progress. For example, if your text consists of lines of text terminated by a newline you could do something like this

Module[{list, t = 0},
 list = ReadList["/users/USER/alltext.txt", "String"];
 Monitor[wordlist = 
   Flatten@Table[
     StringCases[ToLowerCase[list[[t]]], WordCharacter ..], 
      {t, Length[list]}], 
  Labeled[ProgressIndicator[t/Length[list]], N@t/Length[list], Right]];
 Print["Ready"]] 

On a file of about 3 MB this took only marginally more time than Joshua's suggestion.

Wulf answered 18/10, 2011 at 13:0 Comment(1)
Another interesting approach with broad application. I will implement it as well and monitor its relative speed.Electrophilic
S
4

I don't know how Cases works, but List processing can be time consuming, especially if it is building the List as it goes. Since there is an unknown number of terms present in the processed expression, it is likely that is what is occurring with Cases. So, I'd try something slightly different: replacing "" with Sequence[]. For instance, this List

{"5", "6", "7", Sequence[]}

becomes

{"5", "6", "7"}.

So, try

bigList /. "" -> Sequence[]

it should operate faster as it is not building up a large List from nothing.

Sciomancy answered 18/10, 2011 at 3:20 Comment(6)
This is an excellent suggestion - I will try implementing it. Code efficiency is the root problem here!Electrophilic
@Sciomancy I wouldn't worry about the internal list-building happening in Cases. It surely is optimized for list-building and is free from the AppendTo syndrome (quadratic list-building complexity). It is in fact somewhat more efficient than the method with Sequence.Acme
@Leonid, I've had trouble with built-in functions in the past usually involving list generation. (Unfortunately, no specific example comes to mind.) And, I'll admit, I did not test this. I was merely offering a possible alternative.Sciomancy
@ian.milligan The real efficiency gains will likely lie in avoiding using Mathematica's patterns for text manipulations for "as long as possible", but using string patterns, regular expressions, etc. Keep in mind that many string-processing functions like StringCases also work on lists of strings and are very fast.Acme
@Sciomancy Sure, alternatives are always good. I just wanted to point out that Cases does not suffer from this particular deficiency.Acme
@Leonid, not a problem. I rarely use Cases, and my last large data set, I shrunk via SparseArray (mostly 0s in an 80^3 array).Sciomancy

© 2022 - 2024 — McMap. All rights reserved.