How to detect NaN's

Sat Jun 1 04:08:07 AEST 1991

In article <1991May30.204332.16506 at litwin.com>, vlr at litwin.com (Vic Rice)
writes:
> I have encountered a problem on a DECStation 5000 using the Mips Fortran
> compiler. Some operation I am performing is generating a NaN which then
> proceeds to propogate to other dependent variables. Since this is in
> a loop, I am not exactly sure where it starts.
> 
> How can I test a variable for the presence of a NaN or Infinity ???
> -- 
> Dr. Victor L. Rice
> Litwin Process Automation

One of my biggest gripes with both the Sun and Mips Fortran compilers
is that there is no easy way to turn on floating point exception
handling. This results in exactly the problem Vic describes: at the
end of the program, most of your output is infinity or NaN, and you
have no clue where it started.

Here is a simple floating point exception handler that I wrote (Mips
Fortran version). Compile it with C and link with your Fortran (or C)
program, like this:

	% cc -c fpx_mips.c
	% f77 prog.f fpx_mips.o -o prog

or something like that.

To use it, just insert a statement

	call standard_fpx()

or 

	call setcsr(whatever)

as the first executable statement in your Fortran program.

I welcome suggestions for improvement, but I am not prepared to answer
a lot of questions or fix bugs in this code. Good Luck.

A last remark for the knowledgeable: the reason I am trapping SIGILL
is a known bug in Ultrix 4.1. Floating point errors cause SIGILL
instead of SIGFPE.
--
Fritz Keinert                             phone:  (515) 294-5128
Department of Mathematics                 fax:    (515) 294-5454
Iowa State University                     e-mail: keinert at iastate.edu
Ames, IA 50011

--------------------------------------------------------------------------
/**********************************************************************
 *                                                                    *
 *  Floating point exception handler for MIPS machines                *
 *  Fritz Keinert (keinert at iastate.edu)                               *
 *  April 5, 1991                                                     *
 *                                                                    *
 *  building on routines by Howard Jespersen and Srini Uppugunduri    *
 *                                                                    *
 *  As they say in NETLIB: Everything free comes with no guarantee.   *
 *                                                                    *
 * This file contains the following routines:                         *
 *                                                                    *
 * int getcsr()     - returns the contents of the control/status      *
 *                    register (CSR) of the MIPS floating point       *
 *                  unit (FPU). The register contents are             *
 *                  described in /usr/include/mips/fpu.h              *
 *                                                                    *
 * int getcsr_()    - same routine, callable from Fortran             *
 *                                                                    *
 * int setcsr(csr)  - sets CSR to new value, returns old value;       *
 *                    also installs fpx_handler                       *
 *                                                                    *
 * int setcsr_(csr) - same routine, callable from Fortran             *
 *                                                                    *
 * void fpx_handler(...) - floating point exception handler. Prints   *
 *                         an error message and dies.                 *
 *                                                                    *
 * int standard_fpx() - same as setcsr(3584), to set up the           *
 *                      most frequently useful behavior.              *
 *                                                                    *
 *                                                                    *
 * int standard_fpx_() - same routine, callable from Fortran          *
 *                                                                    *
 *                                                                    *
 **********************************************************************
 *                                                                    *
 * The floating point exception handling works like this:             *
 *                                                                    *
 * First, you have to tell the floating point unit to enable          *
 * interrupts, by setting the CSR accordingly. (I assume this         *
 * stands for Control and Status Register). The correct value         *
 * for CSR is found by adding                                         *
 *                                                                    *
 *      one of the following, to define the rounding behavior:        *
 *                                                                    *
 *          0  - round to nearest                                     *
 *          1  -          zero                                        *
 *          2  -          + infinity                                  *
 *          3  -          - infinity                                  *
 *                                                                    *
 *      plus as many as needed of the following, to turn on           *
 *      interrupts:                                                   *
 *                                                                    *
 *        128  - cause interrupt on inexact result                    *
 *        256  -                    underflow                         *
 *        512  -                    overflow                          *
 *       1024  -                    division by zero                  *
 *       2048  -                    invalid operation                 *
 *                                                                    *
 * The standard values are:                                           *
 *                                                                    *
 * 1. round to nearest, interrupt on overflow, division by zero       *
 * and invalid operation. CSR = 3584.                                 *
 *                                                                    *
 * 2. round to nearest, interrupt on underflow, overflow,             *
 * division by zero and invalid operation. CSR = 3840.                *
 *                                                                    *
 * Second, you have to install an interrupt handler to give you       *
 * a message when an interrupt actually happens. Otherwise, the       *
 * operating system will just ignore it, or kill your program with    *
 * a highly informative message like "illegal operation".             *
 *                                                                    *
 * My interrupt handler just prints a message and dies. For more      *
 * sophisticated behavior, like "print a message on underflow, set    *
 * the result to zero, and continue", or a traceback of where this    *
 * happened, you are on your own.                                     *
 *                                                                    *
 * Use the dbx debugger to find out where in your source code the     *
 * problem is. Here is an example: program test dies with the         *
 * message                                                            *
 *                                                                    *
 *      caught signal 4, code 0                                       *
 *      floating point error: overflow                                *
 *      at or near location 0x400448                                  *
 *                                                                    *
 * Call up dbx, execute the first line of your program, set the       *
 * program counter to the address reported by the interrupt routine,  *
 * and ask dbx where that is:                                         *
 *                                                                    *
 *      % dbx test                                                    *
 *      dbx version 2.0                                               *
 *      Type 'help' for help.                                         *
 *      reading symbolic information ...                              *
 *      [using test.test]                                             *
 *      test:   6  print*,'round to nearest  = 0'                     *
 *      (dbx) stop at 6                                               *
 *      [2] stop at "test.f":6                                        *
 *      (dbx) run                                                     *
 *      [2] stopped at [test.test:6 ,0x40020c]                        *
 *                             print*,'round to nearest  = 0'         *
 *      (dbx) assign $pc=0x400448                                     *
 *      4195400                                                       *
 *      (dbx) where                                                   *
 *      >  0 test.test() ["test.f":24, 0x400448]                      *
 *         1 main.main(0x7fffb9c4, 0x7fffb9cc, 0x0, 0x0, 0x0)         *
 *                                  ["main.c":35, 0x400f2c]           *
 *      (dbx) quit                                                    *
 *                                                                    *
 * and we know: it was around line 24 in file test.f.                 *
 *                                                                    *
 * This method may not work if the error is in a system subroutine.   *
 * For example, if you try sqrt(-1.), this will tell you that the     *
 * problem is inside a square root, but there may be many square      *
 * roots in your program.                                             *
 *                                                                    *
 * If all else fails, just run your program inside dbx until it       *
 * dies, and then type "where". This will get you an exact            *
 * traceback of where you are (sqrt function, called from line        *
 * ?? of subroutine ??, which was called from line ?? in main         *
 * program).                                                          *
 *                                                                    *
 * You may have to use the "-g" flag for compiling, to get the        *
 * debugger to work properly.                                         *
 *                                                                    *
 **********************************************************************/

#include <mips/fpu.h>
#include <mips/cpu.h>
#include <signal.h>
#include <stdio.h>

void fpx_handler(int sig, int code, struct sigcontext *scp)
{
        union fpc_csr csr;

        csr.fc_word = scp->sc_fpc_csr;

        fprintf(stderr,"caught signal %d, code %d\n", sig, code);

        if (csr.fc_struct.ex_invalid)
        {
          fprintf(stderr,"floating point error: invalid operation\n");
          fprintf(stderr,"at or near location 0x%x\n",scp->sc_pc);
          exit(1);
        }
        if (csr.fc_struct.ex_divide0)
        {
          fprintf(stderr,"floating point error: division by zero\n");
          fprintf(stderr,"at or near location 0x%x\n",scp->sc_pc);
          exit(1);
        } 
        if (csr.fc_struct.ex_underflow)
        {
          fprintf(stderr,"floating point error: underflow\n");
          fprintf(stderr,"at or near location 0x%x\n",scp->sc_pc);
          exit(1);
        }
        if (csr.fc_struct.ex_overflow)
        {
          fprintf(stderr,"floating point error: overflow\n");
          fprintf(stderr,"at or near location 0x%x\n",scp->sc_pc);
          exit(1);
        }
        if (csr.fc_struct.ex_inexact)
        {
          fprintf(stderr,"floating point error: inexact operation\n");
          fprintf(stderr,"at or near location 0x%x\n",scp->sc_pc);
          exit(1);
        }
        if (csr.fc_struct.ex_unimplemented)
        {
          fprintf(stderr,"floating point error: unimplemented operation\n");
          fprintf(stderr,"at or near location 0x%x\n",scp->sc_pc);
          exit(1);
        }

        if (code == BRK_DIVZERO) {
          fprintf(stderr,"integer error: division by zero\n");
          fprintf(stderr,"at or near location 0x%x\n",scp->sc_pc);
          exit(1);
        }
          exit(1);

        fprintf(stderr,"floating point error: unknown exception code\n");
        fprintf(stderr,"at or near location 0x%x\n",scp->sc_pc);
        fprintf(stderr,"fp status word 0x%x\n",csr);
        exit(1);
}

int getcsr()
/*
 * return contents of fpu control and status register
 */
{
  return get_fpc_csr();
}

int getcsr_()
/*
 * Fortran wrapper for getcsr()
 */
{
  return getcsr();
}

int setcsr(unsigned int csr)
/*
 * set up floating point interrupt handler, turn on interrupts
 */
{
  signal(SIGTRAP, fpx_handler);
  signal(SIGFPE,  fpx_handler);
  signal(SIGILL,  fpx_handler);
  return set_fpc_csr(csr);
}

int setcsr_(unsigned int *csr)
/*
 * Fortran wrapper for setcsr()
 */
{
  return setcsr(*csr);
}

int standard_fpx()
/*
 * set up standard error handling: die on overflow,
 * division by zero, illegal operation, but continue
 * on underflow. Round to nearest.
 */
{
  return setcsr((unsigned int) 3584);
}

int standard_fpx_()
/*
 * Fortran wrapper for standard_fpx()
 */
{
  return standard_fpx();
}