Old Topic: PHP Reads Oversized Files

  Large file, php

As a PHPer who cultivates curd deeply all the year round,It is impossible to focus on memoryAnyway, apache or fpm helped us to do it. Besides, it was destroyed after one run. There was no memory problem at all.

However, it happened that some people with eyes open put these things face-to-face examination questions. For example, some unruly people always use the “php to read a 10G super-large file” face-to-face examination questions to ask you. Of course, as an ordinary fool like me, the first moment you hear this question is stupid, the second moment is lying in the trough, and the third moment is keeping stuttering.

“interview to build rockets, induction screw”. However, it is only a matter of time before someone who just came in and screwed the screw can have an opinion on “PHP reading a 10G super-large file”. At present, in order to be able to “screw” here, the problem of “reading 10G files” must be solved first.

To read a 10G file, first of all, you have to have a 10G file.

… …

In fact, it is relatively simple. let’s just find a log file of nginx, even if it is only 10KB, assuming the file name is test.log, and then execute “cat test.log >> test.log”. listen to me, young man, you should press ctrl+C in about 30 seconds. for example, here, you can feel it:

202MB, as an experimental demonstration, is enough. Is it really necessary to make 10G files?

First of all, we try to use php’s file function to make a death, you feel:

<?php
$begin = microtime( true );
file( './test.log' );
$end = microtime( true );
echo "cost : ".( $end - $begin ).PHP_EOL;

Save it as test.php, and then execute it at the command line. The result is shown in the following figure:

PHP only allocates 128MB of memory to each process at the maximum, but do you want 202MB? So, let’s revise the php configuration file … …

Don’t be soft, change this parameter to 1024MB, and then execute the php script above again:

Then, let’s try our favorite file_get_contents () function again, and the result is as follows:

The file has been loaded into the memory at one time and each line of the file has been saved into a php array. My machine is 10G memory +256G solid state disk. It took 0.67 seconds to load the 202MB file file function at one time and 0.25 seconds to load the file_get_contents function (it seems that the file _ get _ contents is more reliable than the file). However, we can read the 202MB file only by adjusting the configuration file. What if there is a 100G file in front of us? In other words, the php configuration provided by the system is up to 20MB of memory and you cannot modify it?

Our focus is how to read files hundreds of times larger than memory on machines with limited memory. Next, we will adjust memory_limit to 16M to turn on the difficult mode.

The 202MB file allows 16MB of memory to be allocated, so the overall idea is actually very simple, reading bit by bit. As long as the content read each time is less than 16MB, there will certainly be no problem. First, let’s feel the character-by-character reading. The guest on the stage is fgetc function:

<?php
$begin = microtime( true );
$fp = fopen( './test.log' );
while( false !== ( $ch = fgetc( $fp ) ) ){
  // ⚠️⚠️⚠️ 作为测试代码是否正确,你可以打开注释 ⚠️⚠️⚠️
  // 但是,打开注释后屏显字符会严重拖慢程序速度!也就是说程序运行速度可能远远超出屏幕显示速度
  //echo $char.PHP_EOL;
}
fclose( $fp );
$end = microtime( true );
echo "cost : ".( $end - $begin ).PHP_EOL;

The operation results are as follows:

Although only 16M of memory was given, we succeeded in reading out all 202M files, but the running speed was a little bit lower, not very good. We can’t read letter by letter, this time we read line by line:

<?php
$begin = microtime( true );
$fp = fopen( './test.log', 'r' );
while( false !== ( $buffer = fgets( $fp, 4096 ) ) ){
  //echo $buffer.PHP_EOL;
}
if( !feof( $fp ) ){
  throw new Exception('... ...');
}
fclose( $fp );
$end = microtime( true );
echo "cost : ".( $end - $begin ).' sec'.PHP_EOL;

The operation results are as follows:

Line by line is indeed much faster than character by character. On second thought, the maximum memory allocated to us by the system is 16MB. Then we might as well read a certain amount of data at a time to see if it will be faster:

<?php
$begin = microtime( true );
$fp = fopen( './test.log', 'r' );
while( !feof( $fp ) ){
  // 如果你要使用echo,那么,你会很惨烈...
  fread( $fp, 10240 );
}
fclose( $fp );
$end = microtime( true );
echo "cost : ".( $end - $begin ).' sec'.PHP_EOL;
exit;

Save the code, run it, fuck it! ! ! Under the condition of limited memory, we also shortened the time to 0.1 seconds!

Then we considered upgrading the problem. It is still the 202M file mentioned above. This time we asked to read the last 5 lines. This problem seems a bit cocky. Although using the original fread or something works, it always feels stupid. Therefore, new functions have to be introduced to solve this problem: ftell and fseek. Among them, ftell is used to inform the current location of the file reading pointer, and fseek can manually set the location of the file reading pointer. I suggest that you go to the manual to focus on the fseek function:click here.

<?php
$fp = fopen( './test1.log', 'r' );
$line = 5;
$pos = -2;
$ch = '';
$content = '';
while( $line > 0 ){
  while( $ch != "\n" ){
    fseek( $fp, $pos, SEEK_END );
    $ch = fgetc( $fp );
    $pos--;
  }
  $ch = '';
  $content .= fgets( $fp );
  $line--;
}
echo $content;
exit;

The test1.log file contains the following contents:

aa
bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
cccccccccccccccccccccccccccccccc
dddddddddddddddddddddddddddddddd
eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee
ffffffffffffffffffffffffffffffff
bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
cccccccccccccccccccccccccccccccc
dddddddddddddddddddddddddddddddd
eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee
ffffffffffffffffffffffffffffffff
bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
cccccccccccccccccccccccccccccccc
dddddddddddddddddddddddddddddddd
eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee
ffffffffffffffffffffffffffffffff
bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
cccccccccccccccccccccccccccccccc
dddddddddddddddddddddddddddddddd
eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee
ffffffffffffffffffffffffffffffff
bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
cccccccccccccccccccccccccccccccc
dddddddddddddddddddddddddddddddd
eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee
ffffffffffffffffffffffffffffffff
bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
cccccccccccccccccccccccccccccccc
dddddddddddddddddddddddddddddddd
eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee
ffffffffffffffffffffffffffffffff
1111111111
2222222222

Save the file and run it. The result is shown in the following figure: