How does PHP find the same records in two large files?
introduction
Given two files a and b, there are x and y rows of data, where (x, y are both greater than 1 billion), and the machine memory limit is 100M.How to find the same records?
Ideas
The main difficulty in dealing with this problem is the inability to read this massive amount of data into the internal memory at once.
Can't read into the memory at one time, so can it be considered multiple times? If so, how to calculate the same value for multiple readings?
We can use the idea of divide and conquer, big and small.The hash values of the same string are equal afterwards, so we can consider using hash modulo to disperse the records into n files.How to take this n? PHP 100M memory, the array can store about 100w of data, so if the records a and b are only 1 billion rows, n must be at least greater than 200.
At this time, there are 200 files, the same record must be in the same file, and each file can be read into the memory.Then you can find the same records in these 200 files in turn, and then output them to the same file.The final result is the same records in the two files a and b.
It is easy to find the same records in a small file.Use each record as the key of the hash table and count the number of occurrences of the key >=2.
Practice
The 1 billion files are too big, and the actual operation is a waste of time, just to achieve the practical purpose.
The size of the problem is reduced to: 1M memory limit, a and b each have 10w rows of records, the memory limit can be limited by PHP's ini_set('memory_limit', '1M');.
Generate test files
Generate random numbers to fill the file:
/** * Generate random number filling file * Author: ClassmateLin * Email: classmatelin.sit[email protected] * Site: https://classmatelin-1258942535.cos-website.ap-hongkong.myqcloud.com * @param string $filename output file name * @param int $batch How many batches to generate data * @param int $batchSize The size of each batch of data */ function generate(string $filename, int $batch=1000, int $batchSize=10000) { for ($i=0; $i<$batch; $i++) { $str =''; for ($j=0; $j<$batchSize; $j++) { $str .= rand($batch, $batchSize). PHP_EOL; // Generate random numbers } file_put_contents($filename, $str, FILE_APPEND); // write file in append mode } } generate('a.txt', 10); generate('b.txt', 10);
Split file
Split a.txt and b.txt into n files by hashing.
/** * Scatter files into n files using hash modulo * Author: ClassmateLin * Email: [email protected] * Site: https://classmatelin-1258942535.cos-website.ap-hongkong.myqcloud.com * @param string $filename input file name * @param int $mod mod by mod * @param string $dir file output directory */ function spiltFile(string $filename, int $mod=20, string $dir='files') { if (!is_dir($dir)){ mkdir($dir); } $fp = fopen($filename,'r'); while (!feof($fp)){ $line = fgets($fp); $n = crc32(hash('md5', $line))% $mod; // hash modulus $filepath = $dir.'/'. $n.'.txt'; // file output path file_put_contents($filepath, $line, FILE_APPEND); // write file in append mode } fclose($fp); } spiltFile('a.txt'); spiltFile('b.txt');
Execute the splitFile function to get 20 files in the files directory as shown below.
Find duplicate records
Now we need to find the same record in 20 files.In fact, it means to find the same record in one file and operate it 20 times.
Find the same record in a file:
/** * Find the same record in a file and output it to the specified file * Author: ClassmateLin * Email: [email protected] * Site: https://classmatelin-1258942535.cos-website.ap-hongkong.myqcloud.com * @param string $inputFilename input file path * @param string $outputFilename output file path */ function search(string $inputFilename, $outputFilename='output.txt') { $table = []; $fp = fopen($inputFilename,'r'); while (!feof($fp)) { $line = fgets($fp); !isset($table[$line])? $table[$line] = 1: $table[$line]++; // The unset value is set to 1, otherwise it will be incremented } fclose($fp); foreach ($table as $line => $count) { if ($count >= 2){ // The same record that appears more than 2 times is output to the specified file file_put_contents($outputFilename, $line, FILE_APPEND); } } }
Find the same records in all files:
/** * Find the same records from the files in the given directory and output them to the specified file * Author: ClassmateLin * Email: [email protected] * Site: https://classmatelin-1258942535.cos-website.ap-hongkong.myqcloud.com * @param string $dirs specify the directory * @param string $outputFilename output file path */ function searchAll($dirs='files', $outputFilename='output.txt') { $files = scandir($dirs); foreach ($files as $file) { $filepath = $dirs.'/'. $file; if (is_file($filepath)){ search($filepath, $outputFilename); } } }
So far, the space problem of large file processing has been solved, so how to deal with the time problem? A single machine can use the multi-core processing of the CPU, if not enough, it can be processed by multiple servers.
Complete code
<?php ini_set('memory_limit', '1M'); // memory limit 1M /** * Generate random number filling file * Author: ClassmateLin * Email: [email protected] * Site: https://classmatelin-1258942535.cos-website.ap-hongkong.myqcloud.com * @param string $filename output file name * @param int $batch How many batches to generate data * @param int $batchSize The size of each batch of data */ function generate(string $filename, int $batch=1000, int $batchSize=10000) { for ($i=0; $i<$batch; $i++) { $str =''; for ($j=0; $j<$batchSize; $j++) { $str .= rand($batch, $batchSize). PHP_EOL; // Generate random numbers } file_put_contents($filename, $str, FILE_APPEND); // write file in append mode } } /** * Scatter files into n files using hash modulo * Author: ClassmateLin * Email: [email protected] * Site: https://classmatelin-1258942535.cos-website.ap-hongkong.myqcloud.com * @param string $filename input file name * @param int $mod mod by mod * @param string $dir file output directory */ function spiltFile(string $filename, int $mod=20, string $dir='files') { if (!is_dir($dir)){ mkdir($dir); } $fp = fopen($filename,'r'); while (!feof($fp)){ $line = fgets($fp); $n = crc32(hash('md5', $line))% $mod; // hash modulus $filepath = $dir.'/'. $n.'.txt'; // file output path file_put_contents($filepath, $line, FILE_APPEND); // write file in append mode } fclose($fp); } /** * Find the same record in a file and output it to the specified file * Author: ClassmateLin * Email: [email protected] * Site: https://classmatelin-1258942535.cos-website.ap-hongkong.myqcloud.com * @param string $inputFilename input file path * @param string $outputFilename output file path */ function search(string $inputFilename, $outputFilename='output.txt') { $table = []; $fp = fopen($inputFilename,'r'); while (!feof($fp)) { $line = fgets($fp); !isset($table[$line])? $table[$line] = 1: $table[$line]++; // The unset value is set to 1, otherwise it will be incremented } fclose($fp); foreach ($table as $line => $count) { if ($count >= 2){ // The same record that appears more than 2 times is output to the specified file file_put_contents($outputFilename, $line, FILE_APPEND); } } } /** * Find the same records from the files in the given directory and output them to the specified file * Author: ClassmateLin * Email: [email protected] * Site: https://classmatelin-1258942535.cos-website.ap-hongkong.myqcloud.com * @param string $dirs specify the directory * @param string $outputFilename output file path */ function searchAll($dirs='files', $outputFilename='output.txt') { $files = scandir($dirs); foreach ($files as $file) { $filepath = $dirs.'/'. $file; if (is_file($filepath)){ search($filepath, $outputFilename); } } } // Generate file generate('a.txt', 10); generate('b.txt', 10); // Split file spiltFile('a.txt'); spiltFile('b.txt'); // Find records searchAll('files','output.txt');
51 Comments