IT博客汇
  • 首页
  • 精华
  • 技术
  • 设计
  • 资讯
  • 扯淡
  • 权利声明
  • 登录 注册

    On Environment/Package Management in Python

    Yuxin Wu (ppwwyyxxc@gmail.com)发表于 2019-12-23 14:38:00
    love 0

    Python's package management is a mess.I'm involved in a few open source projects and Ioften help users address their environment & installation issues.A large number of these environment issues essentially come down toincorrectly / accidentally mixing multiple different python environment together.This post lists a few common pitfalls and misconceptions of such.

    The Problems

    People often unfortunately have multiple python binaries and multiple installations of python pckages, e.g.:

    1. The OS's package manager can install python and some python packages.
      • Example: /usr/bin/python, /usr/lib/python3.7/*
    2. Users can use setup.py or system's pip install to install new packages to different locations.
      • Example: /usr/local/lib/python3.7/*, $HOME/.local/lib/python3.7/*
    3. Using pip install etc., under a virualenv, can install to a location under the virtualenv.
      • Example: $HOME/my_venv/bin/python, $HOME/my_venv/lib/python3.7/*
    4. Anaconda users will install python packages to its own location.
      • Example: $HOME/anaconda3/bin/python, $HOME/anaconda3/lib/python3.7/*

    To install a library, all of the above methods are very common.As a result, many python developer's machines have multiple environments.A ton of problems can arise from this.

    Be careful of multiple installations of the same package

    For reasons above, you could have multiple installationsof the same package in your system. It often causes very confusing issuesif you think you're using one installation, but is actually using a different one.Examples of such issues include:

    1. You install a package of desired version but still see complaints about wrong package version, orrun into bugs that exist in the wrong version
    2. You build & install a package with your custom changes but they are not effective
    3. You attempt to fix a bug by changing the source code, but you're in fact running another installationof the package so the bug never appears to be fixed

    When such issues appear, remember to verifywhat/where is the library you're using.When in doubt, try the following methods:

    1. Use import lib; print(lib.__version__) to know the version of library you're using.However not all packages have the __version__ attribute. It could also be VERSION, etc.

    2. Use import lib; print(lib.__file__) to know the location of library you're using.This method should work for most packages.

    3. Use strace -fe file python -c 'import lib; do_something_with_lib()' to see every fileused by the command. This tells you everything needed to figure out whetheryou have the issue of multiple installation.

    I have the following command line alias to help me check libraries:

    pylibinfo () {
    python -c "import $1 as X; print(X.__file__, end=' '); print(X.__version__)"
    }

    Don't use pip list or conda list to check package version

    The version you see in these two commands may not match what you're actuallyusing, because there could be multiple versions of the samelibrary in the system installed by pip or conda or other methods.Neither pip nor conda is able to know all of them.

    To tell precisely the version of a library you're using, follow suggestions above.

    Don't use setup.py install to install packages

    Usually, a package installed in this way is not managed by any system:no command can tell you it is installed; no command can uninstall it for you.A pip uninstall for such packages may complain that it "cannot determine which files belong to it", or it may just donothing.You often need to manually remove files to really uninstall it.

    The result is that, when you need to install a different version of it some day in the future, usingother methods (e.g. pip or conda), it either fails, or succeeds but give you a system of multiple installations.

    Prefer python -m pip over pip

    There could be multiple python binaries in your system(e.g., from system, venv, anaconda).However, pip is just a python script: based on how its shebang line is written,some versions of pip pick the python executable from your $PATH,but some versions of pip have hard-coded absolute path to the python executable it will use.

    As a result, when you run pip install directly,it's not immediately clear which python it will use, let alone where the library will be installed.

    On an environment with more than one python,always use python -m pip or /some/python -m pip, instead of the pip command directly.

    Do pip uninstall multiple times

    If you want to uninstall something, uninstall it multiple times until it converges.pip can install one package multiple timesin different locations (e.g., one inside virutalenv/conda + one in $HOME/.local).

    Use python -c 'import lib' to confirm uninstallation

    Not everything can be uninstalled with a simple pip uninstall or conda.Examples are :

    1. Libraries you installed to a different prefix with a different pip.
    2. Libraries installed by the distro or libraries that are installed with setup.py install.
    3. Libraries in your PYTHONPATH.
    4. import lib may be provided by multiple alternative packages.For example, tf-nightly and tensorflow package both provide import tensorflow.It's easy to forget if you've installed both.

    As a result, always use import lib to confirm after you uninstall something.If you're surprised by the successful import, use methods in this article to tell where they are.

    Be careful when declaring dependencies on large packages

    Large, complicated dependencies such as OpenCV, PyTorch, TensorFlow often can be installed in manydifferent ways, only some of them are valid to certain environments.Such dependencies should NOT be declared in setup.py / requirements.txt to be automatically installed.To avoid invalid installation or multiple installation, the choice of how to install thesedependencies should be left with users.

    Unfortunately, 10k+ projects declares opencv-python as a dependency.As a result, their users will automatically install and use the desktop version opencv-python, instead of:

    1. the contrib version opencv-python-contrib, with more features
    2. the headless version opencv-python-headless, with fewer features and fewer compatibility issues
    3. Linux distro's own package, with fewer compatibility issues

    In fact, opencv-python has given suggestionson how to select the right package. "Automatic" selection is simply wrong.Similarly, a project that declares dependency on PyTorch may automatically install one with mismatched CUDA version.

    Be careful when using a library in the root of its source

    You can sometimes have a python library installed already,but you also have its raw source code somewhere in your system.This is another potential case of multiple installation.

    If you execute import libA in the source directory, python may find a local directorycalled libA which contains the source code, and use this source code, rather than thelibA that's actually installed in a different location.

    In addition to the common confusions that can arise from multiple installations,such situation often cause errors, because source code is often an invalid installation itself.In many libraries, the raw source code is different from what actually gets installedafter you run pip install.The most common example is that compiled extensions will not exist in source code.As a result, using a python library in its source code directory often leads to errors.

    The issue is so common that some libraries try todetect and educate the users (e.g.,numpy here andtensorflow here) about it.

    The situation where it is OK to use a source directory includes:

    • Simple libraries where the source code is equal to what gets installed
    • Libraries that can be, and have been installed locally inside the source directory, usually withpip install --editable.

    Never use sudo to install python packages

    Never use sudo pip install or sudo python setup.py,unless it's a virtual system (e.g. docker) that you don't intend to keep long.Because:

    1. It is yet another installation. For example, you can have one version installed with root and one without,causing more trouble.
    2. When you do installation in the future in the right way (without root), this old package cannot be automaticallyupgraded.
    3. It affects all users, causing the "multiple installation" problem for them as well.

    pip install --user can install libraries without root permission (installed to $HOME/.local on linux).This option is sometimes default in latest pip.Or you can use venv if stronger isolation is needed.Now venv is officially part of Python 3.

    You don't need root permission for most installation

    You only need root permission when the library directly interacts with hardware.e.g., you need root permission to install nvidia driver.

    You do not need root permission to,e.g., install a different version of Python, GCC, or CUDA(though a newer CUDA sometimes requires newer driver).But doing these without root permission certainly requires some extra knowledge.

    Avoid mixing binaries built from different sources

    Python itself is a binary, that depends on some other binary libraries.Each python package may also contain binaries or depend on other binary libraries.Mixing binaries built from different sources (e.g. your system package manager v.s. anaconda)together (i.e. into a single process) has potential binary compatibility issues.

    Such issues can happen when you want to use libA and libB together, butthey are built using different versions of another library libC,or built with different C++ compilers.(C compiler, however, should produce binary compatible code across compiler versions).

    Ideally you might expect some mechanism to avoid such conflicts. There is indeed a compilcated set of symbolvisibility & compiler ABI rules, but most libraries are not following them correctly.The result of such incompatibility issues is often a segfault or other mysterious errors.

    In reality, here are how packages are built:

    • Your OS's package manager (apt/yum/pacman, etc) installs many binaries and libs. They arebuilt with the exact system packages they depend on, using the exact compiler installed by the package manager.They are all built in a nice uniform environment that will not have any compatibility issues: allthese packages can be mixed together.

    • When you pip install a package, there are two possiblities:

      • Source distribution: this command compiles source code, using whatever compiler & dependency libraries it finds.So its compatibility will depend on which compiler & libraries it finds.Typically this is controlled by standard environment variables such as $CC,$LIBRARY_PATH, butit varies among packages.

      • Binary wheel distribution: this command downloads a pre-built binary. Thismeans that you need to confirm the binary is built in an environment that's compatible with otherpackages you're using.

        Lots of binary packages on pypi contain the word "manylinux": it means the package is builtsuch that it's supposed to be compatible with most linux environment.Typically, using a manylinux package should not lead to compatibility issues.Although there are exceptions (e.g., some packages incorrectly mark themselves as manylinux).Also, a manylinux package may have suboptimal performance due to the compatibility requirements:they are often built with old version of compilers and old instruction set.

        For other packages without the "manylinux" signature, you can only wish for good luck. They usually work fine but couldstop working at any day.There are a number of github issues in different projects about "import libA causes import libB to crash".Typically these are giant projects, such as OpenCV, TensorFlow, PyTorch.

    • When you conda install a package that contains binaries, it's always pre-built.The official packages are built in anaconda's standard environment, and all the runtime dependencies in that standardenvironment are also packaged and distributed by anaconda.Anaconda provides a (almost) full runtime environment: including essential libs such as libstdc++ and libgcc.This means that the conda world is just like your OS's package manager:if you use conda to install all libraries (and their dependencies), they are always compatiblewith each other.

      That sounds nice, until you want to build a package by yourself.Anaconda provides a full runtime environment, but usually not the build-time environment.Normally you'll still be building the package using your system's compiler & libraries (or those defined by your envvars).

      As long as you use python from conda,you'll almost always run inside conda's runtime environment, using libstdc++, libjpeg, etc fromanaconda/lib.It's then possible that the package you build is not compatible with conda's runtime environment.

      I've frequently seen such failures, e.g.:

      • Build a package using system's gcc. Then it cannot run inside conda's runtime since the runtime isbuilt with an old version of gcc.
      • conda install cudatoolkit=10.1 pytorch gives you a working pytorch in cuda10.1 runtime.It works fine until you build a custom cuda extension: the extension will use nvcc from yoursystem which may not be 10.1.

      That's why I personally avoid conda and use system's python whenever possible.



沪ICP备19023445号-2号
友情链接