Windows Network and PFM Files Corruption

Last update:
November 11, 1998

14 May 1998

A report from Renato Cevoli and then confirmed by Joan Pages states that they are experiencing PFM file corruption when networking Pragma 4 with 4 to 22 workstations.   The workstations run Windows 95, while the server is Windows NT 4.0 and Windows NT 3.51.  The problem also occurs with a Windows 95 peer to peer network.

The file corruption is totally random, but it seems to be due to a collision during the same file access by multiple workstations.  No corruption happens when only one workstation at once accesses the files.

The corruption appears as a workstation that hangs when trying to access a corrupted file or a workstation looping during a GET NEXT in the file.

A client running Pragma 4 since 1996 on a Netware 4.1 network based on 35 Windows 95 clients and the same application software never gets similar file corruption.

It is thought that the Windows 95 client for the Microsoft network could have a bug while the Windows 95 client for Netware does not have it.


19 May 1998

A talk with Renato Cevoli confirms that the problem is random.  He suspects that it is a problem of the network card and Windows 95.  The problem happens especially when the network is not yet stable, that is while it is being built or configured.   Afterwards the problems seem to subside.

Renato checked one workstation and it runs Windows 95, the original version.  The network protocol is Net-BEUI.

Also it seems that mostly the same files get corrupted. 

Renato has experienced on the same networks similar problems when printing over the network with other programs than Pragma.


22 May 1998

From a subsequent talk with Renato Cevoli it seems that the problems of corruption go away when using Windows 95 SR2.


18 September 1998

Renato Cevoli confirmed that the problems of corruption go away eliminating from a network all the original versions of Windows 95.  The important issue is not to have any workstations running the original Windows 95.  One rotten apple spoils the whole basket.

In networks without original Windows 95 workstations he does not experience problems running mock flag 2 or 0 PFM files.  He does not (rightfully) network mock flag 1 files.  They are used only locally. 

Thank you Renato for your help.


Which version of Windows 95 do I have?

To find out which version of Windows 95 you have, go to the DOS prompt of Windows 95 and type VER.  If the answer is "Windows 95. [Version 4.00.950]" we suggest that you upgrade the machine to Windows 95 SR2 or Windows 98 as soon as possible.


1st October 1998

The problem with the PFM file corruption on Windows networks seems to continue.  It manifests itself in the following ways. 

The PFM filemanager code involved is old Pragma 4 code that up to now never gave problems on UNIX or NOVELL. 

Looking at the code both Manfred and I came to the same conclusion.  I'll try to illustrate it taking the deleting of a record as an example.  The same applies to saving and updating a record. 

The function in which the actual delete happens is called deleterecord() and looks (in a simplified way) like this:

lockcontrolunit (file)
get FCS (file)
recordunit=indexfind (file...)
chainread (file, recordunit...)
build key (file...)
indexdelete (file...)
unlockcontrolunit (file)

The writing to the file to delete a record implies the update of the control unit, update of the index and of the actual data.  This is done with low-level functions like seek, write, read, locking, which the MICROSOFT C++ compiler supplies and are fully compatible with Windows 95 and NT in a multiuser environment.  The proof that the code works is that it works 100% of the time in single user environment, except for the specific problems described above in a multiuser environment.

So what can go wrong in a multiuser environment?  Forget for the moment the record locking that locks a record for a user.    During the milliseconds in which the file must be updated, the whole file must get locked.  Obviously, when the index is being rewritten or data added to the file, the whole operation cannot be interrupted by another user.  This locking is accomplished in the functions lockcontrolunit() and unlockcontrolunit().  Lockcontrolunit() locks the part of the file that contains the control page and when that is locked nobody else has access to the file.

The actual locking of the control page is done in the following way:

The file pointer of the file is first set to the beginning of the control page.  If this operation is not successful, a crash occurs.   Then the actual control page is locked.

if (lseek(file->fh, unitotooffset(file, unit), SEEK_SET)  = =  -1)
    fmcrash (file, diskreaderror);
while (1)
   {
   status = locking (file->fh, LK_LOC, size);
   if (status = = 0)
       break;
}

Locking and lseek are C++ functions supplied by Microsoft.  Size is the number of bytes to lock.  The function returns 0 only if the locking was successful.  Thus, only if the locking was successful will the program proceed.  Otherwise it is stuck in the loop.  (This is not as bad as it seems.   The locking lasts only milliseconds while the file is being updated).

The code between the locking and unlocking of the control page takes it for granted that when the file is updated it is locked for everybody else.

Now if for some mysterious reason the control page locking does not work, you can get the problems described at the beginning.  You could have two workstations updating the same file and from there on everything is downhill only.

As you can imagine, it is very difficult to diagnose such a sporadic error.  Is the scenario described above really what happens?  Or does the problem have other manifestations that we have not heard of?

Please keep in mind that the locking and unlocking of a control unit is a very sensitive part of code which affects the overall speed of working with files.

I admit that this whole hypothesis sounds far fetched. Why should a simple locking of a few bytes fail and yet give back an OK status? Or are we barking up a wrong tree?   Anybody has a better idea?


November 5, 1998

After more than 3 weeks of field testing it seems that we have solved the problem in Pragma 5 by introducing Strong Network Locking

 

00-05-08
lp_tec3.gif
t_pfmcor.htm