AI is becoming more and more popular, and Python automation is more accessible than ever. When you need to perform tasks regularly, such as data backup or crawling API data, for such needs, it’s recommended to use Python’s powerful scheduling framework: APScheduler。
Case: Client Countdown Program
At work, there is a Python App running on the client side, which polls our backend to get the latest task settings and saves them locally. After that, it relys on APScheduler to execute offline tasks regularly. When the task is completed, the results are packaged and uploaded back to our backend.
The advantage of this design is that the backend is clean and simple, and the logic of processing timing is concentrated on the client side. The disadvantage is that once a problem occurs, you can only rely on the offline log provided by the customer to debug, and often cannot obtain detailed information.
Issue
I recently encountered a strange problem: the App suddenly stopped accessing the backend and stopped uploading the results files. The log shows that the offline tasks are still being executed stably and the results are continuously packaged. It seems that the scheduler is also functional. This problem has occurred on several clients. Once this happens, the only way to restore to normal is to restart the App.
I initially thought it was a connection issue with the backend, but after reading the issue log, I was blanky. The main program is supposed to keep trying, but sometimes it will never try to connect to the backend again after a normal connection (like it disappeared after a date?).
What goes wrong?
- When the system crashes and the main process does not manage child processes/threads correctly, the child processes will remain, causing unpredictable behavior.
Basically, the main program crashes, the subjobs abort. When you execute a Python script, the Python interpreter starts a main process to run the code, and all threads are created and managed by the main process.
The APScheduler scheduler is also managed by the main process. When the main program crashes, the scheduler stops running, and all sub-jobs should be forced to terminate. It just so happens that in my case, the subtask lasts for N hours, so that new logs continue to be generated after the main program crashes.
Note: If you use multiprocessing (the multiprocessing module), the main process will spawn child processes to perform work. Each subprocess has its own memory space and Python interpreter, and is not restricted by the GIL. But it is still better to set it up so that the main process manages these sub-processes, which will be mentioned later.
Experiment
To reproduce the error, I simulate a small task and try to exit the main program before the child process/thread ends.
| |
Experiment 1 - ThreadPoolExecutor
For I/O bounded tasks, use ThreadPoolExecutor , which is also the default setting.
| |
As you can see, the thread continues to write logs after the main program reports an error exit code.
| |
Experiment 2 - ProcessPoolExecutor
Replace it with the following, the result is the same. It waits for the child process to finish executing:
| |
If you modify to a daemon-process, it means that the current process will automatically end when the parent process terminates, without blocking or continuing execution, and an error will be obtained,
AssertionError: daemonic processes are not allowed to have children , of course, this is not what we want.
Conclusion, besides handling errors, what’s other remedies?
After I solved a small bug that caused the main program to crash, the problem is solved.
But think further, is there any good solution? The following are some common ways to better support main program crashes:
- Set up a cron job or use systemd monitoring (Linux) to restart the main program once it crashes.
- Use Celery or Redis Queue (RQ) to completely separate the job from the main program’s scheduling. APScheduler is only responsible for scheduling, and execution is handed over to independent job managers or workers. Its additional benefit is that it supports persistence, and unfinished jobs can continue to execute after the main program is restarted.
In fact, our backend which deals the upload file is using RQ. Maybe write about it in the future.
