Watch your permissions

For the last month I have been plagued with an issue on one of my client’s production API’s that had me stumped and kept me scratching my head in confusion.

The Problem

I would wake up to an onslaught of New Relic notifications on my phone saying that the monitors that I set up were failing — we check a private API method every minute to ensure availability and that everything is running correctly — and the API was not available.

The errors were the same every single morning. The Composer autoload file would fail to load the dependancies and cause fatal errors (because of permissions). Fixing the errors was trivial, and every morning I would run the same set of commands to get it back up.

rm -rf vendor/
composer install
chmod -R 0777 path/to/assets

Simple enough to fix, but annoying enough that it would cause me lack of sleep because of the affect it had on the actual system.

The Cause

I spent days on Stack Overflow, GitHub and searching Google for a similar issue, and came up with nothing. I tried bleeding edge versions of Composer, and tried rolling back to earlier versions, but the problem remained.

It was difficult to pinpoint what was at fault. Re-installing Composer had no effect. Deleting all the composer.lock files and re-installing the dependancies did nothing to change it.

My instincts told me it had to be related to a CRON job because of fact that it happened around the same time every morning (between 3:10am and 3:45am)

But…

I had checked the crontab over and over. I’d even gone as far as manually running each script that was sitting in the main crontab and checking the logs or New Relic to see if the error had cropped up, but nothing.

I checked crontab for every user on the server, there were no jobs scheduled besides the root user, and those were not the fault.

Because of the irregularities in the actual start time of the fault I started to doubt if it was related to cron at all, so last night (or early morning) I decided that I was going to get rid of this bug once and for all. I opened multiple SSH sessions and set about monitoring each window, running various commands: top, ps aux (via a custom script), iotop and monitoring the process stack on New Relic and logs.

It was like a scene out of the Matrix, except on one screen.

At 3:40am I started to think that something I had done over the course of the day had worked because the error hadn’t cropped up again, and then just as I was about to call it a night, my phone went nuts, and top started throwing up a bit of a fuss. Top defaults to a 3 second interval so I was lucky enough to catch the command that was running right as the errors started.

The command: “find”

Well, that didn’t help much did it. How could find have any effect on Composer dependancies? It is a search tool.

Baffled, I returned back to New Relic hoping it would shed some light on the situation, but alas, nothing. It did however help in finding the issue.

I logged back into one of the SSH sessions I had opened and checked the logs directory. There had to be something that was logged somewhere that could help me. So I ran a quick command:

ls -alt /var/log

Looking through the listing of the log files in date order, I found a breadcrumb. The logrotate script that compressed the log files daily, was running at the exact time of the errors: the timestamp on each of the compressed log files exactly matched the time (to the minute) of the errors starting on the API.

So it was cron related, but where was it coming from? It wasn’t in crontab, so I dug a bit more. I checked the cPanel WHM scheduler — my client likes cPanel and manages his own hosting so I didn’t have a say in this ;) — and nothing showed up. But if logrotate was running at the same time as the errors, it had to be a schedule of some sort, so I went a little deeper, and then I found it.

Anacron

There are 3 cron folders on the server, cron.daily, cron.weekly and cron.monthly. I checked the anacron config file, and found that the cron.daily folder and scripts contained were run every day, between 3:10 and 3:45, so I had found the culprit, and after a quick scan of the list of scripts that had to run, found the mastermind behind this: A custom script written by the hosting company that would reset all permissions for PHP files.

In short it would change ownership of any file that was owned by “nobody” and remove read permissions on php files and write permissions on writable folders.

while read USER_NAME; do

nice find /home/”$USER_NAME”/public_html -user nobody -print0|nice xargs -0 chown “$USER_NAME”

done

nice find /home/*/public_html -type f -iname ‘*.php*’ -perm /066 -print0|nice xargs -0 chmod go-rwx &

nice find /home/*/public_html -type d -perm /022 -print0|nice xargs -0 chmod go-w

Now this script makes sense, and I guess the script was put there by the hosting company as a preventative against attacks, to remove unwanted security holes.

The problem with this is though, is that it would reset all PHP files, including those owned by root (which composer runs under) and because the API files are owned by a different user, it caused the permissions issues and the ownership issues.

The solution was simple enough, I removed the script from the cron.daily folder and will be speaking with the hosting company to work on a better script that allows this script to run without affecting root level scripts. I also have other monitoring in place so the need for this script is not really required in this case.

Footnotes

It can be noted that if the owner of the PHP files belonged to the same user as the other files in the domain, then the permissions issue didn’t crop up, but every time I run composer update it reset the ownership because I log in with root.

Yes, I know, we shouldn’t use root as the main user, all my own servers and client servers I manage all have root login disabled, but the hosting company my client uses, provides the servers like that and it is a requirement of how they get access in the case that they need to.

Yes, I could give each domain user SSH access and run the tasks under sudo, which is the best and safest route, but there are a few sites that run on this server and I am the only administrator, so it works.

I also prefer to have programs like composer run at a global level as we use it at various different levels on the server(s).

Navigation

For the last month I have been plagued with an issue on one of my client’s production API’s that had me stumped and kept me scratching my head in confusion.

The Problem

The Cause

Footnotes