Monitoring Process of Cases[] on a Very Large Body of Information

E

4

13

I'm currently undertaking operations on a very large body of text (~290MB of plain text in one file). After importing it into Mathematica 8, I'm currently beginning operations to break it down into lowercase words, etc. so I can begin textual analysis.

The problem is that these processes take a long time. Would there be a way to monitor these operations through Mathematica? For operations with a variable, I've used ProgressIndicator etc. But this is different. My searching of documentation and StackOverflow has not turned up anything similar.

In the following, I would like to monitor the process of the Cases[ ] command:

input=Import["/users/USER/alltext.txt"];
wordList=Cases[StringSplit[ToLowerCase[input],Except[WordCharacter]],Except[""]];

Electrophilic answered 18/10, 2011 at 2:51 Comment(2)

I wonder if your question is about monitoring the Cases[] progress, or about optimizing your code. They are two entirely unlike problems – Libertarian 18/10, 2011 at 12:38

@belisarius Almost, but not entirely.. I gather from the responses that my need/request to monitor Cases[] stems from some slower choices in my code. Also, perhaps there is no readily apparent way to monitor such progress.. – Electrophilic 18/10, 2011 at 12:56

C

10

It is possible to view the progress of the StringSplit and Cases operations by injecting "counter" operations into the patterns being matched. The following code temporarily shows two progress bars: the first showing the number of characters processed by StringSplit and the second showing the number of words processed by Cases:

input = ExampleData[{"Text", "PrideAndPrejudice"}];

wordList =
  Module[{charCount = 0, wordCount = 0, allWords}
  , PrintTemporary[
      Row[
        { "Characters: "
        , ProgressIndicator[Dynamic[charCount], {0, StringLength@input}]
        }]]

  ; allWords = StringSplit[
        ToLowerCase[input]
      , (_ /; (++charCount; False)) | Except[WordCharacter]
      ]

  ; PrintTemporary[
      Row[
        { "Words:      "
        , ProgressIndicator[Dynamic[wordCount], {0, Length@allWords}]
        }]]

  ; Cases[allWords, (_ /; (++wordCount; False)) | Except[""]]

  ]

The key to the technique is that the patterns used in both cases match against the wildcard _. However, that wildcard is guarded by a condition that always fails -- but not until it has incremented a counter as a side effect. The "real" match condition is then processed as an alternative.

Criss answered 18/10, 2011 at 13:59 Comment(6)

+1 just be aware that in my machine this is 7 times slower than the same non monitored code – Libertarian 18/10, 2011 at 14:12

Very clever! +1. You can simply use Except[WordCharacter]/;(++charCount;True) and Except[""] /; (++wordCount; True) instead of (_ /; (++charCount; False)) | Except[WordCharacter] and (_ /; (++wordCount; False)) | Except[""] with the same success but with more efficiency. Usage of DeleteCases instead of Cases may give even more speedup as Joshua Martell points out. – Intranuclear 18/10, 2011 at 14:24

@Alexey That is what I tried at first, but it did not count all characters and words -- only those that matched the pattern. – Criss 18/10, 2011 at 14:26

Addition: usage of /; NumberQ[++charCount] and /; NumberQ[++wordCount] gives a little even more speedup and shorter code. – Intranuclear 18/10, 2011 at 14:37

@Criss Now I understand what you mean. Interesting. – Intranuclear 18/10, 2011 at 14:48

Fantastic - let me give this a try as well. This is a really interesting approach that has broad application. – Electrophilic 18/10, 2011 at 15:1

G

11

Something like StringCases[ToLowerCase[input], WordCharacter..] seems to be a little faster. And I would probably use DeleteCases[expr, ""] instead of Cases[expr, Except[""]].

Greenes answered 18/10, 2011 at 4:3 Comment(0)

C

10

It is possible to view the progress of the StringSplit and Cases operations by injecting "counter" operations into the patterns being matched. The following code temporarily shows two progress bars: the first showing the number of characters processed by StringSplit and the second showing the number of words processed by Cases:

input = ExampleData[{"Text", "PrideAndPrejudice"}];

wordList =
  Module[{charCount = 0, wordCount = 0, allWords}
  , PrintTemporary[
      Row[
        { "Characters: "
        , ProgressIndicator[Dynamic[charCount], {0, StringLength@input}]
        }]]

  ; allWords = StringSplit[
        ToLowerCase[input]
      , (_ /; (++charCount; False)) | Except[WordCharacter]
      ]

  ; PrintTemporary[
      Row[
        { "Words:      "
        , ProgressIndicator[Dynamic[wordCount], {0, Length@allWords}]
        }]]

  ; Cases[allWords, (_ /; (++wordCount; False)) | Except[""]]

  ]

The key to the technique is that the patterns used in both cases match against the wildcard _. However, that wildcard is guarded by a condition that always fails -- but not until it has incremented a counter as a side effect. The "real" match condition is then processed as an alternative.

Criss answered 18/10, 2011 at 13:59 Comment(6)

+1 just be aware that in my machine this is 7 times slower than the same non monitored code – Libertarian 18/10, 2011 at 14:12

Very clever! +1. You can simply use Except[WordCharacter]/;(++charCount;True) and Except[""] /; (++wordCount; True) instead of (_ /; (++charCount; False)) | Except[WordCharacter] and (_ /; (++wordCount; False)) | Except[""] with the same success but with more efficiency. Usage of DeleteCases instead of Cases may give even more speedup as Joshua Martell points out. – Intranuclear 18/10, 2011 at 14:24

@Alexey That is what I tried at first, but it did not count all characters and words -- only those that matched the pattern. – Criss 18/10, 2011 at 14:26

Addition: usage of /; NumberQ[++charCount] and /; NumberQ[++wordCount] gives a little even more speedup and shorter code. – Intranuclear 18/10, 2011 at 14:37

@Criss Now I understand what you mean. Interesting. – Intranuclear 18/10, 2011 at 14:48

Fantastic - let me give this a try as well. This is a really interesting approach that has broad application. – Electrophilic 18/10, 2011 at 15:1

W

5

It depends a little on what your text looks like, but you could try splitting the text into chunks and iterate over those. You could then monitor the iterator using Monitor to see the progress. For example, if your text consists of lines of text terminated by a newline you could do something like this

Module[{list, t = 0},
 list = ReadList["/users/USER/alltext.txt", "String"];
 Monitor[wordlist = 
   Flatten@Table[
     StringCases[ToLowerCase[list[[t]]], WordCharacter ..], 
      {t, Length[list]}], 
  Labeled[ProgressIndicator[t/Length[list]], N@t/Length[list], Right]];
 Print["Ready"]]

On a file of about 3 MB this took only marginally more time than Joshua's suggestion.

Wulf answered 18/10, 2011 at 13:0 Comment(1)

Another interesting approach with broad application. I will implement it as well and monitor its relative speed. – Electrophilic 18/10, 2011 at 15:2

S

4

I don't know how Cases works, but List processing can be time consuming, especially if it is building the List as it goes. Since there is an unknown number of terms present in the processed expression, it is likely that is what is occurring with Cases. So, I'd try something slightly different: replacing "" with Sequence[]. For instance, this List

{"5", "6", "7", Sequence[]}

becomes

{"5", "6", "7"}.

So, try

bigList /. "" -> Sequence[]

it should operate faster as it is not building up a large List from nothing.

Sciomancy answered 18/10, 2011 at 3:20 Comment(6)

This is an excellent suggestion - I will try implementing it. Code efficiency is the root problem here! – Electrophilic 18/10, 2011 at 3:30

@Sciomancy I wouldn't worry about the internal list-building happening in Cases. It surely is optimized for list-building and is free from the AppendTo syndrome (quadratic list-building complexity). It is in fact somewhat more efficient than the method with Sequence. – Acme 18/10, 2011 at 3:32

@Leonid, I've had trouble with built-in functions in the past usually involving list generation. (Unfortunately, no specific example comes to mind.) And, I'll admit, I did not test this. I was merely offering a possible alternative. – Sciomancy 18/10, 2011 at 3:36

@ian.milligan The real efficiency gains will likely lie in avoiding using Mathematica's patterns for text manipulations for "as long as possible", but using string patterns, regular expressions, etc. Keep in mind that many string-processing functions like StringCases also work on lists of strings and are very fast. – Acme 18/10, 2011 at 3:36

@Sciomancy Sure, alternatives are always good. I just wanted to point out that Cases does not suffer from this particular deficiency. – Acme 18/10, 2011 at 3:38

@Leonid, not a problem. I rarely use Cases, and my last large data set, I shrunk via SparseArray (mostly 0s in an 80^3 array). – Sciomancy 18/10, 2011 at 4:14

Recommended topics

Hot tags