Merge pull request #709 from Michael-F-Bryan/speed

Speed
2026-06-05 23:00:18 +00:00 · 2016-06-04 11:33:05 -07:00
parent da6648f9bd dde23c230e
commit 3ec10e2c83
1 changed files with 206 additions and 2 deletions
@@ -226,13 +226,212 @@ Numba
 -----
 .. todo:: Write about Numba and the autojit compiler for NumPy
-Threading
+Concurrency
-:::::::::
+:::::::::::
 Concurrent.futures
 ------------------
 The `concurrent.futures`_ module is a module in the standard library that
 provides a "high-level interface for asynchronously executing callables". It
 abstracts away a lot of the more complicated details about using multiple
 threads or processes for concurrency, and allows the user to focus on 
 accomplishing the task at hand.
 The `concurrent.futures`_ module exposes two main classes, the
 `ThreadPoolExecutor` and the `ProcessPoolExecutor`. The ThreadPoolExecutor
 will create a pool of worker threads that a user can submit jobs to. These jobs
 will then be executed in another thread when the next worker thread becomes
 available.  
 The ProcessPoolExecutor works in the same way, except instead of using multiple
 threads for its workers, it will use multiple processes. This makes it possible
 to side-step the GIL, however because of the way things are passed to worker
 processes, only picklable objects can be executed and returned.
 Because of the way the GIL works, a good rule of thumb is to use a
 ThreadPoolExecutor when the task being executed involves a lot of blocking
 (i.e. making requests over the network) and to use a ProcessPoolExecutor
 executor when the task is computationally expensive.
 There are two main ways of executing things in parallel using the two
 Executors. One way is with the `map(func, iterables)` method. This works
 almost exactly like the builtin `map()` function, except it will execute
 everything in parallel. :
 .. code-block:: python
    from concurrent.futures import ThreadPoolExecutor
    import requests
    def get_webpage(url):
        page = requests.get(url)
        return page
    pool = ThreadPoolExecutor(max_workers=5)
    my_urls = ['http://google.com/']*10  # Create a list of urls
    for page in pool.map(get_webpage, my_urls):
        # Do something with the result
        print(page.text)
 For even more control, the `submit(func, *args, **kwargs)` method will schedule 
 a callable to be executed ( as `func(*args, **kwargs)`) and returns a `Future`_
 object that represents the execution of the callable.
 The Future object provides various methods that can be used to check on the
 progress of the scheduled callable. These include:
 cancel()
    Attempt to cancel the call.
 cancelled()
    Return True if the call was successfully cancelled.
 running()
    Return True if the call is currently being executed and cannot be
    cancelled.
 done()
    Return True if the call was successfully cancelled or finished running.
 result()
    Return the value returned by the call. Note that this call will block until
    the scheduled callable returns by default.
 exception()
    Return the exception raised by the call. If no exception was raised then
    this returns `None`. Note that this will block just like `result()`.
 add_done_callback(fn)
    Attach a callback function that will be executed (as `fn(future)`) when the
    scheduled callable returns.
 .. code-block:: python
    from concurrent.futures import ProcessPoolExecutor, as_completed
    def is_prime(n):
        if n % 2 == 0:
            return n, False
        sqrt_n = int(n**0.5)
        for i in range(3, sqrt_n + 1, 2):
            if n % i == 0:
                return n, False
        return n, True
    PRIMES = [
        112272535095293,
        112582705942171,
        112272535095293,
        115280095190773,
        115797848077099,
        1099726899285419]
    futures = []
    with ProcessPoolExecutor(max_workers=4) as pool:
        # Schedule the ProcessPoolExecutor to check if a number is prime
        # and add the returned Future to our list of futures
        for p in PRIMES:
            fut = pool.submit(is_prime, p)
            futures.append(fut)
    # As the jobs are completed, print out the results
    for number, result in as_completed(futures):
        if result:
            print("{} is prime".format(number))
        else:
            print("{} is not prime".format(number))
 The `concurrent.futures`_ module contains two helper functions for working with
 Futures. The `as_completed(futures)` function returns an iterator over the list
 of futures, yielding the futures as they complete.
 The `wait(futures)` function will simply block until all futures in the list of
 futures provided have completed.
 For more information, on using the `concurrent.futures`_ module, consult the
 official documentation.
 Threading
 ---------
 The standard library comes with a `threading`_ module that allows a user to
 work with multiple threads manually.
 Running a function in another thread is as simple as passing a callable and
 it's arguments to `Thread`'s constructor and then calling `start()`:
 .. code-block:: python
    from threading import Thread
    import requests
    def get_webpage(url):
        page = requests.get(url)
        return page
    some_thread = Thread(get_webpage, 'http://google.com/')
    some_thread.start()
 To wait until the thread has terminated, call `join()`:
 .. code-block:: python
    some_thread.join()
 After calling `join()`, it is always a good idea to check whether the thread is
 still alive (because the join call timed out):
 .. code-block:: python
    if some_thread.is_alive():
        print("join() must have timed out.")
    else:
        print("Our thread has terminated.")
 Because multiple threads have access to the same section of memory, sometimes
 there might be situations where two or more threads are trying to write to the
 same resource at the same time or where the output is dependent on the sequence
 or timing of certain events. This is called a `data race`_ or race condition. 
 When this happens, the output will be garbled or you may encounter problems
 which are difficult to debug. A good example is this `stackoverflow post`_.  
 The way this can be avoided is by using a `Lock`_ that each thread needs to
 acquire before writing to a shared resource. Locks can be acquired and released
 through either the contextmanager protocol (`with` statement), or by using
 `acquire()` and `release()` directly. Here is a (rather contrived) example:
 .. code-block:: python
    from threading import Lock, Thread
    file_lock = Lock()
    def log(msg):
        with file_lock:
            open('website_changes.log', 'w') as f:
                f.write(changes)
    def monitor_website(some_website):
        """
        Monitor a website and then if there are any changes, 
        log them to disk.
        """
        while True:
            changes = check_for_changes(some_website)
            if changes:
                log(changes)
    websites = ['http://google.com/', ... ]
    for website in websites:
        t = Thread(monitor_website, website)
        t.start()
 Here, we have a bunch of threads checking for changes on a list of sites and
 whenever there are any changes, they attempt to write those changes to a file
 by calling `log(changes)`. When `log()` is called, it will wait to acquire
 the lock with `with file_lock:`. This ensures that at any one time, only one
 thread is writing to the file. 
 Spawning Processes
 ------------------
@@ -248,3 +447,8 @@ Multiprocessing
 .. _`New GIL`: http://www.dabeaz.com/python/NewGIL.pdf
 .. _`Special care`: http://docs.python.org/c-api/init.html#threads
 .. _`David Beazley's`: http://www.dabeaz.com/GIL/gilvis/measure2.py
 .. _`concurrent.futures`: https://docs.python.org/3/library/concurrent.futures.html
 .. _`Future`: https://docs.python.org/3/library/concurrent.futures.html#concurrent.futures.Future
 .. _`threading`: https://docs.python.org/3/library/threading.html
 .. _`stackoverflow post`: http://stackoverflow.com/questions/26688424/python-threads-are-printing-at-the-same-time-messing-up-the-text-output
 .. _`data race`: https://en.wikipedia.org/wiki/Race_condition