• notice
  • Congratulations on the launch of the Sought Tech site

Randomly sample a certain amount of data from the dataset

Turning to a previously answered question today: Randomly draw a certain amount of data from a list. The previous answer was solved using the Fisher-Yates shuffling algorithm, but after reading the comments, some new ideas.

Let's not talk about the algorithm first, just talk about the idea of random extraction.

Algorithm Evolution of Random Draw

Suppose there are n data stored in a sourcelist (array in JavaScript), you need to randomly extract m (m <= n) data, and put the result in another list resultSince random extraction is a mrepetitive loopsource of insourceresultDescribed in JavaScript is

function  randomSelect ( source , m ) { 
const result = [];
for ( let i = 0 ; i < m ; i ++ ) {
const rIndex = ~ ~ ( Math . random () * source . length );
result . push ( source [ rIndex ]);
source . splice ( rIndex, 1 );
   }
return result ;
}

In most languages, deleting a piece of data from the middle of the list will cause subsequent data rearrangement, which is an inefficient operation. Considering that randomly selecting an event from a set of data is an equal probability event, it has nothing to do with the location of the data, we can remove the selected data, instead of reducing the length of the list, directly move the last data in the list. The last element is not taken into account when the next random location is taken. This improved algorithm:

function  randomSelect ( source , m ) { 
const result = [];
for ( let i = 0 , n = source . length ; i < m ; i ++ , n -- ) {
// ^^^^^^^^ ^^^^^^^^^^ ^^^
const rIndex = ~ ~ ( Math . random () * n );
result . push( source [ rIndex ]);
source [ rIndex ] = source [ n - 1 ];
// ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^
    }
return result ;
}

Note thatn-- andn - 1 can be merged, after merging:

for ( let  i  =  0 , n  =  source . length ; i  <  m ; i ++ ) { 
...
source [ rIndex ] = source [ -- n ];
}

At this time, I noticed again that the unused space at the sourceback is actually the resultsame size as the space. If you use this part of the space, you will no longer need it result..

function  randomSelect ( source , m ) { 
for ( let i = 0 , n = source . length ; i < m ; i ++ ) {
const rIndex = ~ ~ ( Math . random () * n );
-- n ;
/ / Swap the data of the selected position with the data of the current last position
        [ source [ rIndex ], source [ n ]]= [ source [ n ], source [ rIndex ]];
   }
// Returning the next m data is a randomly selected
return source . slice ( - m );   // -m is equivalent
to source. length - m }

If you keep the originalresult and related algorithms, you will find thatresult and the array elements returned now are in reverse order. But it doesn't matter, because our purpose is to choose randomly, whether revert or not, the result set is random.

But this way, suppose m = source.lengthsourcethe data in the entire is randomly arranged - this is the Fisher -Yates algorithm . Of course, in fact, only  processing is required source.length - 1to achieve the effect of complete shuffling.

Fisher-Yates shuffle algorithm

Fisher-Yates efficient and equal-probability shuffling algorithm. The core idea is to randomly extract a number from 1 to n and exchange it with the last number (n), and then randomly select a number from 1 to n-1 and exchange it with the penultimate number (n-1)… ..., after n - 1 rounds, the data in the original list is completely randomly shuffled.

Each time it is swapped with the "current last element", the processing result is appended. If you instead swap elements with the current position (that is, thei position , you can prepend the result set. But it should be noted that the selection of random numbers is not in [0, n) (0 < n < source.length)this range, but [i, source.length)in this range:

function  randomSelect ( source , m ) { 
for ( let i = 0 , n = source . length ; i < m ; i ++ , n -- ) {
const rIndex = ~ ~ ( Math . random () * n ) + i ;
       [ source [ rIndex ], source [ i]] = [ source [ i ], source [ rIndex ]];
   }
return source . slice ( 0 , m );
}

This process can be understood with a diagram (10 randomly selected)

Randomly sample a certain amount of data from a dataset_algorithm

Since it is a shuffling algorithm, in the case of a small amount of data, a ready-made tool function can be used to shuffle the cards and then extract continuous data of a specified size to achieve the purpose of random extraction. For example, use the _.shuffle() method of Loadsh .

import  _  from  "lodash" ; 
const m = 10 ;
const result = _ . shuffle ( data ). slice ( 0 , m );

There are two problems here

  • If the amount of original data is large, or the amount of original data is quite different from the amount of data to be extracted, a lot of computing power will be wasted

  • The shuffling algorithm modifies the order of elements in the original dataset

For the first question, just use the previously randomSelect()typed , and the second question will be discussed below.

Improve, do not modify the original data set

To not change the original data, that is, do not exchange or shift the elements in the data source. But need to know which data has been selected, what should we do? There are several methods

  • Add a set of selected element sequence numbers, and if the calculatedrIndex can be found in this set, select it again.
    This is a method, but as the ratio of optional sequence numbers to non-selectable sequence numbers gradually decreases, the probability of reselection will greatly increase, and the efficiency of the entire algorithm cannot be guaranteed at all. 

  • Also use a selected sequence number set according to the above method, but when a collision occurs, the sequence number is not reselected, but the sequence number is accumulated and modulo.
    This method is a bit more stable than the previous one, but still suffers from less stable accumulation calculations and may reduce randomness. 


  • ...

Think about it, the purpose of exchanging the last unused element with the element rIndexwhere  is located is to allowrIndex to hit an unfetched value when it reappears - assuming that this value is not from the original data set. fetch, but from an additional dataset?

For example, in rIndex = 5the case of , the first time you getsource[5] to get 6, at this time the last value should be assigned, which is source[5] = source[13] = 14We change this assignment process to map[5] = source[13] = 14; the next time you hit rIndex = 5, first mapcheck if exists map[5]in, if there is, use it, and don’t sourceuse the element in. To describe this general process in code is:

const  n  =  
const value = map [ rIndex ] ?? source [ rIndex ];
result . push ( value );   // can be combined with previous sentence
map [ rIndex ] = map [ n ] ?? source [ n ];

mapThe modified value corresponding to an index sourcetois value in , check first mapIfmap is there, take the one that is in ;mapmapsource

It sounds more abstract, or the picture above

Randomly sample a certain amount of data from the dataset_Random Sort_02

The corresponding code is also easy to write:

function  randomSelect ( source , m ) { 
const result = [];
const map = new Map ();

for ( let i = 0 , n = source . length ; i < m ; i ++ , n -- ) {
const rIndex = ~ ~ ( Math . random ( ) * n ) + i ;
result [ i ] = map . get ( rIndex ) ?? source [ rIndex ];
// ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^
map . set ( rIndex , map . get ( i ) ?? source [ i ]);
// ^^^^^^^^^^^^^^^^^^^^^^^^^
    }

return result ;
}

Tip: This code can be compared with the last randomeSelectcode .


Tags

Technical otaku

Sought technology together

Related Topic

0 Comments

Leave a Reply

+