When should I use critical sections?
Asked Answered
M

3

5

Here's the deal. My app has a lot of threads that do the same thing - read specific data from huge files(>2gb), parse the data and eventually write to that file.

Problem is that sometimes it could happen that one thread reads X from file A and second thread writes to X of that same file A. A problem would occur?

The I/O code uses TFileStream for every file. I split the I/O code to be local(static class), because I'm afraid there will be a problem. Since it's split, there should be critical sections.

Every case below is local(static) code that is not instaniated.

Case 1:

procedure Foo(obj:TObject);
begin ... end;

Case 2:

procedure Bar(obj:TObject);
var i: integer;
begin
  for i:=0 to X do ...{something}
end;

Case 3:

function Foo(obj:TObject; j:Integer):TSomeObject
var i:integer;
begin
  for i:=0 to X do
    for j:=0 to Y do
      Result:={something}
end;

Question 1: In which case do I need critical sections so there are no problems if >1 threads call it at same time?

Question 2: Will there be a problem if Thread 1 reads X(entry) from file A while Thread 2 writes to X(entry) to file A?

When should I use critical sections? I try to imagine it my head, but it's hard - only one thread :))

EDIT

Is this going to suit it?

{a class for every 2GB file}

TSpecificFile = class
  cs: TCriticalSection;
  ...
end;

TFileParser = class
  file :TSpecificFile;
  void Parsethis; void ParseThat....
end;

function Read(file: TSpecificFile): TSomeObject;
begin
  file.cs.Enter;
  try
    ...//read
  finally
    file.cs.Leave;
  end;
end;

function Write(file: TSpecificFile): TSomeObject;
begin
  file.cs.Enter;
  try
    //write
  finally
    file.cs.Leave
  end;
end;

Now will there be a problem if two threads call Read with:

case 1: same TSpecificFile

case 2: different TSpecificFile?

Do i need another critical section?

Mesomorphic answered 19/3, 2011 at 9:1 Comment(3)
My advice: Buy yourself a copy of Joe Duffy's Book Concurrent Programming on Windows and learn about this properly. You can't learn about concurrent programming in dribs and drabs.Milline
Critical sections are not the only thread programming tool in your toolbox. I'm glad you didn't decide to use only TThread.Synchronize either. As David says, you need to know a lot about this topic before you can design multi-threaded code properly. That includes learning when to split up your design too. Have you considered having a writer thread, and a processor thread, and having the results of the writing merely queued, and all written by a single writer thread?Secco
Having one writer process may not scale on systems with parallel I/O enabled e.g. a RAID...Augustina
C
7

In general, you need a locking mechanism (critical sections are a locking mechanism) whenever multiple threads may access a shared resource at the same time, and at least one of the threads will be writing to / modifying the shared resource.
This is true whether the resource is an object in memory or a file on disk.
And the reason that the locking is necessary is that, is that if a read operation happens concurrently with a write operation, the read operation is likely to obtain inconsistent data leading to unpredictable behaviour.
Stephen Cheung has mentioned the platform specific considerations with regards file handling, and I'll not repeat them here.

As a side note, I'd like to highlight another concurrency concern that may be applicable in your case.

  • Suppose one thread reads some data and starts processing.
  • Then another thread does the same.
  • Both threads determine that they must write a result to position X of File A.
  • At best the values to be written are the same, and one of the threads effectively did nothing but waste time.
  • At worst, the calculation of one of the threads is overwritten, and the result is lost.

You need to determine whether this would be a problem for your application. And I must point out that if it is, just locking the read and write operations will not solve it. Furthermore, trying to extend the duration of the locks leads to other problems.

Options

Critical Sections

Yes, you can use critical sections.

  • You will need to choose the best granularity of the critical sections: One per whole file, or perhaps use them to designate specific blocks within a file.
  • The decision would require a better understanding of what your application does, so I'm not going to answer for you.
  • Just be aware of the possibility of deadlocks:
    • Thread 1 acquires lock A
    • Thread 2 acquires lock B
    • Thread 1 desires lock B, but has to wait
    • Thread 2 desires lock A - causing a deadlock because neither thread is able to release its acquired lock.

I'm also going to suggest 2 other tools for you to consider in your solution.

Single-Threaded

What a shocking thing to say! But seriously, if your reason to go multi-threaded was "to make the application faster", then you went multi-threaded for the wrong reason. Most people who do that actually end up making their applications, more difficult to write, less reliable, and slower!

It is a far too common misconception that multiple threads speed up applications. If a task requires X clock-cycles to perform - it will take X clock-cycles! Multiple threads don't speed up tasks, it permits multiple tasks to be done in parallel. But this can be a bad thing! ...

You've described your application as being highly dependent on reading from disk, parsing what's read and writing to disk. Depending on how CPU intensive the parsing step is, you may find that all your threads are spending the majority of their time waiting for disk IO operations. In which case, the multiple threads generally only serve to shunt the disk heads to the far 'corners' of your (ummm round) disk platters. Disk IO is still the bottle-neck, and the threads make it behave as if the files are maximally fragmented.

Queueing Operations

Let's suppose your reason for going multi-threaded are valid, and you do still have threads operating on shared resources. Instead of using locks to avoid concurrency issues, you could queue your shared resource operations onto specific threads.

So instead of Thread 1:

  • Reading position X from File A
  • Parsing the data
  • Writing to position Y in file A

Create another thread; the FileA thread:

  • the FileA has a queue of instructions
  • When it gets to the instruction to read position X, it does so.
  • It sends the data to Thread 1
  • Thread 1 parses its data --- while FileA thread continues processing instructions
  • Thread 1 places an instruction to write its result to position Y at the back of FileA thread's queue --- while FileA thread continues to process other instructions.
  • Eventually FileA thread will write the data as required by Trhead 1.
Carbuncle answered 19/3, 2011 at 11:28 Comment(1)
+1 for the reasoning behind single-threaded. None the less an current I/O system can push 100 Mb/s easily, and the threads might read data tens of megabytes at a time, limiting trashing to something more bearable. Or one can implement the producer-consumer algorithm, where only one thread does I/O and multiple threads do the parsing and processing.Mccormack
A
5

Synchronization is only needed for shared data that can cause a problem (or an error) if more than one agent is doing something with it.

Obviously the file writing operation should be wrapped in a critical section for that file only if you don't want other writer processes to trample on the new data before the write is completed -- the file may no long be consistent if you have half of the new data modified by another process that does not see the other half of the new data (that hasn't been written out by the original writer process yet). Therefore you'll have a collection of CS's, one for each file. That CS should be released asap when you're done with writing.

In certain cases, e.g. memory-mapped files or sparse files, the O/S may allow you to write to different portions of the file at the same time. Therefore, in such cases, your CS will have to be on a particular segment of the file. Thus you'll have a collection of CS's (one for each segment) for each file.

If you write to a file and read it at the same time, the reader may get inconsistent data. In some O/S's, reading is allowed to happen simultaneously with a write (perhaps the read comes from cached buffers). However, if you are writing to a file and reading it at the same time, what you read may not be correct. If you need consistent data on reads, then the reader should also be subject to the critical section.

In certain cases, if you are writing to a segment and read from another segment, the O/S may allow it. However, whether this will return correct data usually cannot be guaranteed because there you can't always tell whether two segments of the file may be residing in one disk sector, or other low-level O/S things.

So, in general, the advise is to wrap any file operation in a CS, per file.

Theoretically, you should be able to read simultaneously from the same file, but locking it in a CS will only allow one reader. In that case, you'll need to separate your implementation into "read locks" and "write locks" (similar to a database system). This is highly non-trivial though as you'll then have to deal with promoting different levels of locks.

After note: The kind of thing you're trying to data (reading and writing huge data sets that are GB's in size simultaneously in segments) is what is typically done in a database. You should be looking into breaking your data files into database records. Otherwise, you either suffer from non-optimized read/write performance due to locking, or you end up re-inventing the relational database.

Augustina answered 19/3, 2011 at 9:13 Comment(3)
I edited my question. Could you tell me what you think about below of 'EDIT'?Mesomorphic
"In most O/S's, you cannot write to the same file simultaneously." - do you have a source for that? I'm not saying it's wrong, it's just not what I observed experimentally. Maybe I ran into the one operating system that does allow simultaneous writes to files.Mccormack
Well, I am just basing it on the O/S's that I am familiar with. Nothing really scientific blush. I believe the standard file I/O calls in Windows and Unix allow you to append while reading, or append simultaneously. There are flags that allow overlapped writes though, but the result may not be well defined if two processes write to the same area at the same time. So technically I may be wrong in making that statement. I'll edit.Augustina
M
3

Conclusion first

You don't need TCriticalSection. You should implement a Queue-based algorithm that guarantees no two threads are working on the same piece of data, without blocking.

How I got to that conclusion

First of all Windows (Win 7?) will allow you to simultaneously write to a file as many times as you see fit. I have no idea what it does with the writes, and I'm clearly not saying it's a good idea, but I've just done the following test to prove Windows allows simultaneous multiple writes to the same file:

I made a thread that opens a file for writing (with "share deny none") and keeps writing random stuff to a random offset for 30 seconds. Here's a pastebin with the code.

Why a TCriticalSection would be bad

A critical section only allows one thread to access the protect resource at any given time. You have two options: Only hold the lock for the duration of the read/write operation, or hold the lock for the entire time required to process the given resource. Both have serious problems.

Here's what might happen if a thread holds the lock only for the duration of the read/write operations:

  • Thread 1 acquires the lock, reads the data, releases the lock
  • Thread 2 acquires the lock, reads the same data, releases the lock
  • Thread 1 finishes processing, acquires the lock, writes the data, releases the lock
  • Thread 2 acquires the lock, writes the data, and here's the oops: Thread 2 has been working on old data, since Thread 1 made changes in the background!

Here's what might happen if a thread holds the lock for the entire round-trim read & write operation:

  • Thread 1 acquires the lock, starts reading data
  • Thread 2 tries to acquire the same lock, gets blocked...
  • Thread 1 finishes reading the data, processes the data, writes the data back to file, releases the lock
  • Thread 2 acquires the lock and starts processing the same data again !

The Queue solution

Since you're multi-threading, and you can have multiple threads simultaneously processing data from the same file, I assume data is somehow "context free": You can process the 3rd part of a file before processing the 1st. This must be true, because if it's not, you can't multi-thread (or are limited to 1 thread per file).

Before you start processing you can prepare a number of "Jobs", that look like this:

  • File 'file1.raw', offset 0, 1024 Kb
  • File 'file1.raw', offset 1024, 1024 kb.
  • ...
  • File 'fileN.raw', offset 99999999, 1024 kb

Put all those "jobs" in a queue. Have your threads dequeue one Job from the queue and process it. Since no two jobs overlap, threads don't need to synchronize with each other, so you don't need the critical section. You only need the critical section to protect access to the Queue itself. Windows makes sure threads can read and write to/from the files just fine, as long as they stick to the allocated "Job".

Mccormack answered 19/3, 2011 at 10:27 Comment(4)
For the producer-consumer approach - is it this? --> docwiki.embarcadero.com/RADStudio/en/…Mesomorphic
Also, I understand your idea. To use cache with last 500 entries that has been read. The read procedure to check the cache before reading. I have that done, sorry I didn't mention it.Mesomorphic
No. The producer-consumer is implemented using a Queue. Here's a docwiki link for the Queue: docwiki.embarcadero.com/CodeExamples/en/…Mccormack
I edited my answer, moved the sample code to Pastebin because it was taking too much space in here and replace the generic "producer-consumer" text with a more detailed explanation of how things (should) work. Just re-read the last section, that's the one with the Queue stuff.Mccormack

© 2022 - 2024 — McMap. All rights reserved.