How to fix symbol lookup error: undefined symbol errors in a cluster environment
Asked Answered
G

1

47

I'm working on some python code that extracts some image data from an ECW file using GDAL (http://www.gdal.org/) and its python bindings. GDAL was built from source to have ECW support.

The program is run on a cluster server that I ssh into. I have tested the program through the ssh terminal and it runs fine. However, I would now like to submit a job to the cluster using qsub, but it reports the following:

Traceback (most recent call last):
  File "./gdal-test.py", line 5, in <module>
    from osgeo import gdal
  File "/home/h3/ctargett/.local/lib/python2.6/site-packages/GDAL-1.11.1-py2.6-linux-x86_64.egg/osgeo/__init__.py", line 21, in <module>
    _gdal = swig_import_helper()
  File "/home/h3/ctargett/.local/lib/python2.6/site-packages/GDAL-1.11.1-py2.6-linux-x86_64.egg/osgeo/__init__.py", line 17, in swig_import_helper
    _mod = imp.load_module('_gdal', fp, pathname, description)
ImportError: /mnt/aeropix/prgs/.local/lib/libgdal.so.1: undefined symbol: H5Eset_auto2

I did a bit more digging and tried using LD_DEBUG=symbols to try and work out where the difference was, but that's about as far as my knowledge/understanding has got me.

For reference, here's what happens with LD_DEBUG=symbols and running the code in the ssh terminal (piping through grep H5Eset_auto2 to reduce some of the output):

Symbol debug output for code running in ssh terminal:

 11359: symbol=H5Eset_auto2;  lookup in file=/usr/bin/python26 [0]
 11359: symbol=H5Eset_auto2;  lookup in file=/usr/lib64/libpython2.6.so.1.0 [0]
 11359: symbol=H5Eset_auto2;  lookup in file=/lib64/libpthread.so.0 [0]
 11359: symbol=H5Eset_auto2;  lookup in file=/lib64/libdl.so.2 [0]
 11359: symbol=H5Eset_auto2;  lookup in file=/lib64/libutil.so.1 [0]
 11359: symbol=H5Eset_auto2;  lookup in file=/lib64/libm.so.6 [0]
 11359: symbol=H5Eset_auto2;  lookup in file=/lib64/libc.so.6 [0]
 11359: symbol=H5Eset_auto2;  lookup in file=/lib64/ld-linux-x86-64.so.2 [0]
 11359: symbol=H5Eset_auto2;  lookup in file=/home/h3/ctargett/.local/lib/python2.6/site-packages/GDAL-1.11.1-py2.6-linux-x86_64.egg/osgeo/_gdal.so [0]
 11359: symbol=H5Eset_auto2;  lookup in file=/usr/lib64/libpython2.6.so.1.0 [0]
 11359: symbol=H5Eset_auto2;  lookup in file=/mnt/aeropix/prgs/.local/lib/libgdal.so.1 [0]
 11359: symbol=H5Eset_auto2;  lookup in file=/usr/lib64/libstdc++.so.6 [0]
 11359: symbol=H5Eset_auto2;  lookup in file=/lib64/libm.so.6 [0]
 11359: symbol=H5Eset_auto2;  lookup in file=/lib64/libgcc_s.so.1 [0]
 11359: symbol=H5Eset_auto2;  lookup in file=/lib64/libpthread.so.0 [0]
 11359: symbol=H5Eset_auto2;  lookup in file=/lib64/libc.so.6 [0]
 11359: symbol=H5Eset_auto2;  lookup in file=/lib64/libdl.so.2 [0]
 11359: symbol=H5Eset_auto2;  lookup in file=/lib64/libutil.so.1 [0]
 11359: symbol=H5Eset_auto2;  lookup in file=/mnt/aeropix/prgs/.local/lib/libhdf5.so.7 [0]
 11359: symbol=H5Eset_auto2;  lookup in file=/usr/bin/python26 [0]
 11359: symbol=H5Eset_auto2;  lookup in file=/usr/lib64/libpython2.6.so.1.0 [0]
 11359: symbol=H5Eset_auto2;  lookup in file=/lib64/libpthread.so.0 [0]
 11359: symbol=H5Eset_auto2;  lookup in file=/lib64/libdl.so.2 [0]
 11359: symbol=H5Eset_auto2;  lookup in file=/lib64/libutil.so.1 [0]
 11359: symbol=H5Eset_auto2;  lookup in file=/lib64/libm.so.6 [0]
 11359: symbol=H5Eset_auto2;  lookup in file=/lib64/libc.so.6 [0]
 11359: symbol=H5Eset_auto2;  lookup in file=/lib64/ld-linux-x86-64.so.2 [0]
 11359: symbol=H5Eset_auto2;  lookup in file=/home/h3/ctargett/.local/lib/python2.6/site-packages/GDAL-1.11.1-py2.6-linux-x86_64.egg/osgeo/_gdal.so [0]
 11359: symbol=H5Eset_auto2;  lookup in file=/usr/lib64/libpython2.6.so.1.0 [0]
 11359: symbol=H5Eset_auto2;  lookup in file=/mnt/aeropix/prgs/.local/lib/libgdal.so.1 [0]
 11359: symbol=H5Eset_auto2;  lookup in file=/usr/lib64/libstdc++.so.6 [0]
 11359: symbol=H5Eset_auto2;  lookup in file=/lib64/libm.so.6 [0]
 11359: symbol=H5Eset_auto2;  lookup in file=/lib64/libgcc_s.so.1 [0]
 11359: symbol=H5Eset_auto2;  lookup in file=/lib64/libpthread.so.0 [0]
 11359: symbol=H5Eset_auto2;  lookup in file=/lib64/libc.so.6 [0]
 11359: symbol=H5Eset_auto2;  lookup in file=/lib64/libdl.so.2 [0]
 11359: symbol=H5Eset_auto2;  lookup in file=/lib64/libutil.so.1 [0]
 11359: symbol=H5Eset_auto2;  lookup in file=/mnt/aeropix/prgs/.local/lib/libhdf5.so.7 [0]

Symbol debug output for code submitted using qsub:

 16915: symbol=H5Eset_auto2;  lookup in file=/usr/bin/python26 [0]
 16915: symbol=H5Eset_auto2;  lookup in file=/usr/lib64/libpython2.6.so.1.0 [0]
 16915: symbol=H5Eset_auto2;  lookup in file=/lib64/libpthread.so.0 [0]
 16915: symbol=H5Eset_auto2;  lookup in file=/lib64/libdl.so.2 [0]
 16915: symbol=H5Eset_auto2;  lookup in file=/lib64/libutil.so.1 [0]
 16915: symbol=H5Eset_auto2;  lookup in file=/lib64/libm.so.6 [0]
 16915: symbol=H5Eset_auto2;  lookup in file=/lib64/libc.so.6 [0]
 16915: symbol=H5Eset_auto2;  lookup in file=/lib64/ld-linux-x86-64.so.2 [0]
 16915: symbol=H5Eset_auto2;  lookup in file=/home/h3/ctargett/.local/lib/python2.6/site-packages/GDAL-1.11.1-py2.6-linux-x86_64.egg/osgeo/_gdal.so [0]
 16915: symbol=H5Eset_auto2;  lookup in file=/usr/lib64/libpython2.6.so.1.0 [0]
 16915: symbol=H5Eset_auto2;  lookup in file=/mnt/aeropix/prgs/.local/lib/libgdal.so.1 [0]
 16915: symbol=H5Eset_auto2;  lookup in file=/usr/lib64/libstdc++.so.6 [0]
 16915: symbol=H5Eset_auto2;  lookup in file=/lib64/libm.so.6 [0]
 16915: symbol=H5Eset_auto2;  lookup in file=/lib64/libgcc_s.so.1 [0]
 16915: symbol=H5Eset_auto2;  lookup in file=/lib64/libpthread.so.0 [0]
 16915: symbol=H5Eset_auto2;  lookup in file=/lib64/libc.so.6 [0]
 16915: symbol=H5Eset_auto2;  lookup in file=/lib64/libdl.so.2 [0]
 16915: symbol=H5Eset_auto2;  lookup in file=/lib64/libutil.so.1 [0]
 16915: symbol=H5Eset_auto2;  lookup in file=/mnt/aeropix/prgs/.local/lib/libhdf5.so.7 [0]
 16915: symbol=H5Eset_auto2;  lookup in file=/usr/lib64/libjpeg.so.62 [0]
 16915: symbol=H5Eset_auto2;  lookup in file=/usr/lib64/libpng12.so.0 [0]
 16915: symbol=H5Eset_auto2;  lookup in file=/usr/lib64/libpq.so.4 [0]
 16915: symbol=H5Eset_auto2;  lookup in file=/usr/lib64/libcurl.so.3 [0]
 16915: symbol=H5Eset_auto2;  lookup in file=/usr/lib64/libgssapi_krb5.so.2 [0]
 16915: symbol=H5Eset_auto2;  lookup in file=/usr/lib64/libkrb5.so.3 [0]
 16915: symbol=H5Eset_auto2;  lookup in file=/usr/lib64/libk5crypto.so.3 [0]
 16915: symbol=H5Eset_auto2;  lookup in file=/lib64/libcom_err.so.2 [0]
 16915: symbol=H5Eset_auto2;  lookup in file=/usr/lib64/libidn.so.11 [0]
 16915: symbol=H5Eset_auto2;  lookup in file=/lib64/libssl.so.6 [0]
 16915: symbol=H5Eset_auto2;  lookup in file=/lib64/libcrypto.so.6 [0]
 16915: symbol=H5Eset_auto2;  lookup in file=/mnt/aeropix/prgs/.local/lib/libNCSEcw.so.0 [0]
 16915: symbol=H5Eset_auto2;  lookup in file=/mnt/aeropix/prgs/.local/lib/libNCSEcwC.so.0 [0]
 16915: symbol=H5Eset_auto2;  lookup in file=/mnt/aeropix/prgs/.local/lib/libNCSCnet.so.0 [0]
 16915: symbol=H5Eset_auto2;  lookup in file=/mnt/aeropix/prgs/.local/lib/libNCSUtil.so.0 [0]
 16915: symbol=H5Eset_auto2;  lookup in file=/lib64/librt.so.1 [0]
 16915: symbol=H5Eset_auto2;  lookup in file=/usr/lib64/libxml2.so.2 [0]
 16915: symbol=H5Eset_auto2;  lookup in file=/mnt/aeropix/prgs/.local/lib/libz.so.1 [0]
 16915: symbol=H5Eset_auto2;  lookup in file=/lib64/ld-linux-x86-64.so.2 [0]
 16915: symbol=H5Eset_auto2;  lookup in file=/lib64/libcrypt.so.1 [0]
 16915: symbol=H5Eset_auto2;  lookup in file=/lib64/libresolv.so.2 [0]
 16915: symbol=H5Eset_auto2;  lookup in file=/lib64/libnsl.so.1 [0]
 16915: symbol=H5Eset_auto2;  lookup in file=/usr/lib64/libkrb5support.so.0 [0]
 16915: symbol=H5Eset_auto2;  lookup in file=/lib64/libkeyutils.so.1 [0]
 16915: symbol=H5Eset_auto2;  lookup in file=/lib64/libselinux.so.1 [0]
 16915: symbol=H5Eset_auto2;  lookup in file=/lib64/libsepol.so.1 [0]
 16915: /mnt/aeropix/prgs/.local/lib/libgdal.so.1: error: symbol lookup error: undefined symbol: H5Eset_auto2 (fatal)
ImportError: /mnt/aeropix/prgs/.local/lib/libgdal.so.1: undefined symbol: H5Eset_auto2

I guess I'm not sure why it seems to stop looking in libgdal.so.1 when submitted using qsub, when it continues to look when just run in the terminal. I also note that the qsub job is able to correctly locate libhdf5.so.7 (which is where it should find H5Eset_auto2 ) as it can find a different symbol, H5Eprint :

 16915: symbol=H5Eprint;  lookup in file=/usr/lib64/libpython2.6.so.1.0 [0]
 16915: symbol=H5Eprint;  lookup in file=/mnt/aeropix/prgs/.local/lib/libgdal.so.1 [0]
 16915: symbol=H5Eprint;  lookup in file=/usr/lib64/libstdc++.so.6 [0]
 16915: symbol=H5Eprint;  lookup in file=/lib64/libm.so.6 [0]
 16915: symbol=H5Eprint;  lookup in file=/lib64/libgcc_s.so.1 [0]
 16915: symbol=H5Eprint;  lookup in file=/lib64/libpthread.so.0 [0]
 16915: symbol=H5Eprint;  lookup in file=/lib64/libc.so.6 [0]
 16915: symbol=H5Eprint;  lookup in file=/lib64/libdl.so.2 [0]
 16915: symbol=H5Eprint;  lookup in file=/lib64/libutil.so.1 [0]
 16915: symbol=H5Eprint;  lookup in file=/mnt/aeropix/prgs/.local/lib/libhdf5.so.7 [0]

Any pointers on this would be incredibly useful at this stage (I hope that's enough information - I'm more than happy to provide more information, I'm just not sure what else might be useful at this stage).

EDIT:

It seems that the contents of /usr/bin are different for jobs submitted using qsub (specifically libtool is missing). This is being investigated.

Godoy answered 21/11, 2014 at 5:32 Comment(27)
One thing that might help is to look at the environment variable LD_LIBRARY_PATH when you're running interactively and when it's running in the queue. Look for any discrepancies related to HDF5.Spiller
I don't think so, I think there might be something more going on than that. I know the qsub job can find libhdf5.so.7 as it correctly locates symbol H5Eprint before getting stuck on H5Eset_auto2 As a minimum working example, the single line from osgeo import gdal works when I run it through the ssh terminal, but fails when submitted using qsub. Also, I have set export LD_LIBRARY_PATH=/mnt/aeropix/prgs/.local/lib (which is where libhdf5.so.7 is)Godoy
I don't have an answer yet but if you look at the very first output (interactive), notice that the lines consists of two identical groups, one repeated after the other. However, in the second output (in the queue system) the first group of lines is the same but the second group of lines looks very different.Spiller
Another thing that might help you debug: capture all the environment variables in your job script (env >$HOME/myenvs), then reproduce that same environment (env $(xargs <$HOME/myenvs) bash) and see if it fails in the exact same way. If it does, examine and look for any suspicious discrepancies from your interactive environment.Spiller
That's okay, thanks for looking into it anyway! I hadn't actually noticed they were two identical groups (or not in the second case) until now. I did notice that the second one, at least in the first half, looks at libhdf5.so.7 but then continues onward?Godoy
It suggests that there are two separate "processes": first one works just fine and appears to be using the identical environment, whereas the second one does not and appears to be "doing its own thing" (and thus failing to find HDF5).Spiller
Okay... is that normal? also, I tried replicating the cluster environment variables but it didn't throw an error in the ssh terminalGodoy
That's really odd. So I looked at the code of the GDAL bindings and it appears to be trying to load the libraries twice. Can you try running this interactively? Also try it in the queue to see if you get the same results. f, p, d = imp.find_module('_gdal', ['/home/h3/ctargett/.local/lib/python2.6/site-packages/GDAL-1.11.1-py2.6-linux-x86_64.egg/osgeo/']); print(f, p, d); imp.load_module('_gdal', f, p, d) (Replace the semicolons with newlines if you wish.) The code essentially tries to replicate what is done by the GDAL bindings, and hopefully will help narrow down the problem.Spiller
(Note: prepend import imp;)Spiller
Sorry for the long response time, I was out of the office for the weekend. When I run that that interactively I get: drive.google.com/file/d/0By6cKos4YGCGWEtQNWM1ck9veVk/… However, when run in the queue I get: drive.google.com/file/d/0By6cKos4YGCGWHlndERYZ1hlaDg/…Godoy
That's really odd. Sorry, I don't have any good guesses right now, but you can try checking other things and see if they make any difference: Are jobs run on a different user? Are the Python executables different? (versions?) Are the shells different? (e.g. Bash) Is there any clue from the detailed output of the dynamic loader? (LD_DEBUG) Does loading gdal work if you execute it non-interactively? (E.g. ssh user@cluster 'python -c "import osgeo.gdal"')Spiller
It is indeed quite odd. The jobs are run with the same user name and the python executables are both the same (2.6.8). The shells might be different - they're both bash, but the interactive one seems to use /bin/bash whereas the queue uses /usr/local/bin/bash. However, I've tried using both and they both work fine interactively and both fail when used with the queue. I tried requesting an interactive session from the queue (using qsub -I), and that results in the same error (undefined symbol: H5Eset_auto2) as submitting the job.Godoy
Using ssh user@cluster 'export LD_LIBRARY_PATH=/mnt/aeropix/prgs/.local/lib ; python26 -c "from osgeo import gdal"' works fine thoughGodoy
Can you try putting the LD_LIBRARY_PATH variable export in your .bashrc and see if it affects the job? On another note, would you be able to paste the (full, ungrepped) logs from the dynamic loader? (use LD_DEBUG=all)Spiller
I put LD_LIBRARY_PATH in my .bashrc but there was no change. Here are the links to the full debug logs (they're quite large). ssh output: drive.google.com/file/d/0By6cKos4YGCGYjl3Z2twaGVTYVE/… qsub output: drive.google.com/file/d/0By6cKos4YGCGT2t5ZGpycGYzaU0/…Godoy
Can you also check if libtool is of the same version?Spiller
Can you run this both interactively and in the queue and look at the output? They ought to produce the exact same output - I can't imagine how they wouldn't be but I'm out of ideas :\ nm -D /mnt/aeropix/prgs/.local/lib/libhdf5.so.7 | grep H5Eset_auto2Spiller
The output of nm -D using ssh gives 0000003126a915b0 T H5Eset_auto2, and in the queue gives nothing at all.Godoy
Well this is bizarre, when I do libtool --version in the ssh terminal I get that it's version 1.5.22 (1.1220.2.365), while in the queue I get libtool: command not found. I tried doing whereis libtool and got /usr/bin/libtool for the ssh terminal, and nothing found when in the queue. Doing ls /usr/bin I find that they appear to be different for the ssh terminal and the queue - I am investigating this further.Godoy
Just be to sure, do an md5sum and see if the queue sees the exact same library file (libhdf5.so.7) as in interactive mode.Spiller
md5sum in interactive mode: 1560c4cf2f8bd0c5e64879f423f44bdf md5sum in queue: e01aad433d9d6ff2f8e48b07280d6442 they do appear to be different, although I've no idea how.Godoy
Interesting ... can you create a copy of libhdf5.so.7 somewhere else and see if the queue sees the same file? (If it does, try putting that new location in LD_LIBRARY_PATH instead.)Spiller
Solution found! Turns out libhdf5.so.7 was symlinked to a location that was apparently different in the queue. I modified the symlink and now everything seems to be working correctly. Would you like to answer the question? You did most of the work so I'd say you earned it.Godoy
That's good to hear! You've done your part too investigating this. (In hindsight this probably should've been the first thing to check.)Spiller
Ah yes, hindsight - I guess the next clue really should've been when we found that /usr/bin was different for the queue. For further reference, after being symlinked all over the place, libhdf5.so.7 ended up pointing to /usr/lib64/libhdf5.so.7.0.1. Thanks for your help, you're a champ.Godoy
Seems that gdal has mostly been a red herring here. One thing that puzzles me: how did that symlink get created? It seems like a rather bad idea!Spiller
Honestly, I have no idea any more. GDAL is painful at best to get working when you need to build it from scratch (which I had to, as I needed ECW support). It actually took me weeks to even get it to build as I ran into a plethora of problems including gcc 4.1.2 not playing nicely with the first version of the ECW library we got our hands on (and the associated 'hoop jumping' you need to do in order to request changes to installed software on the cluster). In the end we got an older ECW library and just built and installed it in a user area.Godoy
S
42

After two dozens of comments to understand the situation, it was found that the libhdf5.so.7 was actually a symlink (with several levels of indirection) to a file that was not shared between the queued processes and the interactive processes. This means even though the symlink itself lies on a shared filesystem, the contents of the file do not and as a result the process was seeing different versions of the library.

For future reference: other than checking LD_LIBRARY_PATH, it's always a good idea to check a library with nm -D to see if the symbols actually exist. In this case it was found that they do exist in interactive mode but not when run in the queue. A quick md5sum revealed that the files were actually different.

Spiller answered 24/11, 2014 at 23:30 Comment(4)
Also note, when following symlinks, remember to keep following them until you've reached the end.Godoy
The "For future reference" bit was great advice that helped me solve my problem. Amazing how specific this question is, yet given all the upvotes, how ubiquitously it applies.Barytes
e.g. nm -D /usr/lib/x86_64-linux-gnu/mesa/libGL.so.1 | grep _glapi_tls_DispatchTernion
Mentioning LD_LIBRARY_PATH in your answer made me realise I was setting LD_LIBRARY_PATH in .bashrc. Commenting this part out this line in .bashrc fixed my issue (/usr/lib/git-core/git-remote-https: symbol lookup error: /lib/x86_64-linux-gnu/libp11-kit.so.0: undefined symbol: ffi_type_pointer, version LIBFFI_BASE_7.0 )Larch

© 2022 - 2024 — McMap. All rights reserved.