• notice
  • Congratulations on the launch of the Sought Tech site

Analyze questions encountered by Mysql's large amount of data and solutions

In projects, we often encounter importing large amounts of data into the database in order to use sql for data analysis.In the process of importing data, we will encounter some problems that need to be solved.Here, we will combine the practice of importing a 4G txt data to show the problems encountered and the solutions.On the one hand, make a summary record by yourself, on the other hand, I hope Have a reference for those friends who have encountered the same problem.

The data I imported is a txt file of Baike, the file size is more than 4G, there are more than 65 million pieces of data, and each piece of data is separated by a newline character.Each piece of data contains three fields, separated by Tab.The method to take out the data is to use a TripleData class to store these three fields, the fields are all Strings, and then store multiple pieces of data in List<TripleData>, and then store List<TripleData> in mysql database, save all data in batches to mysql in the database.

The above is a general idea, and the following are the problems encountered during the specific import process.

1 Garbled characters and compatibility issues in database connection.

If there is Chinese in the data, be sure to set the encoding parameter of the URL of the link database, and set the URL to the following form.

URL="jdbc:mysql://"+IP+":"+PORT+"/"+DB_NAME+"?useSSL=false&useUnicode=true&characterEncoding=utf-8"; 

Setting the encoding to UTF-8 is to solve the garbled problem, and setting useSSL is to solve the compatibility problem between JDBC and mysql.If useSSL is not set, an error will be reported.Similar to

Establishing SSL connection without server's identity verification is not recommended.According to MySQL 5.5.45+, 5.6.26+ and 5.7.6+ requirements SSL connection must be established by default if explicit option isn't set.For compliance with existing applications not using SSL the verifyServerCertificate property is set to'false'.You need either to explicitly disable SSL by setting useSSL=false, or set useSSL=true and provide truststore for server certificate verification.

Such error messages.The main reason is that the mysql version is relatively high, and the JDBC version is relatively low, and compatibility is required.

2 utf8mb4 encoding problem

In the process of importing data, you will encounter something similar to

SQLException :Incorrect string value:'\xF0\xA1\x8B\xBE\xE5\xA2...' for column'name'

This kind of error message is because the utf-8 set in mysql is 3 bytes by default, which is no problem for general data.If it is a large amount of data, it will inevitably contain some WeChat emoticons.Or special characters, they occupy 4 bytes, and utf-8 cannot handle it, so an error is reported.The solution is that MySQL introduced a 4-byte utf-8 encoding in versions after 5.5.3, that is, utf8mb4, and the encoding of mysql needs to be reset.

You can follow the steps below.One is to back up the database to be modified.Although utf8mb4 is backward compatible with utf8, in order to prevent improper operation, you still need to take precautions and make a backup.The second is to modify the character set encoding of the database to utf8mb4-UTF-8 Unicode, and the collation utf8mb4_general_ci.I am using Navicat to modify the above modification.You can find out how to modify it with the command line.The third is to modify the configuration file my.ini, in the root directory of the mysql installation.Add the following settings.

[client]
default-character-set = utf8mb4
[mysqld]
character-set-server=utf8mb4
collation-server=utf8mb4_general_ci
[mysql]
default-character-set = utf8mb4

After the modification is completed, mysql needs to be restarted for the modification to take effect.

Then import the data, it should be able to import normally.

3 Time efficiency issues of mass import

Because our data volume is relatively large, we divided the data.I divided 65 million data into 500 files, each with about 110,000 data, and put these 110,000 data in ArrayList<TripleObject> ; In, and then batch import.The general idea is to use the "insert into tb (...) values(...),(...)...;" method to insert one time with insert, which will save a lot of time.The example method is as follows.

public static void insertSQL(String sql,List<TripleObject> tripleObjectList) throws SQLException{
    Connection conn=null;
    PreparedStatement psts=null;
    try {
      conn=DriverManager.getConnection(Common.URL, Common.DB_USERNAME, Common.DB_PASSWORD);
      conn.setAutoCommit(false);//Set manual submission
     //save the sql suffix
      StringBuffer suffix = new StringBuffer();
      int count = 0;
      psts=conn.prepareStatement("");
      String s="";
      String p="";
      String o="";
      while (count<tripleObjectList.size()) {
        s=tripleObjectList.get(count).getSubject().replaceAll(",", ".").replaceAll("\\(", "").replaceAll("\\)", "").replaceAll( "\'", "").replaceAll("\\\\", "");
        p=tripleObjectList.get(count).getPredicate().replaceAll(",", ".").replaceAll("\\(", "").replaceAll("\\)", "").replaceAll( "\'", "").replaceAll("\\\\", "");
        o=tripleObjectList.get(count).getObject().replaceAll(",", ".").replaceAll("\\(", "").replaceAll("\\)", "").replaceAll( "\'", "").replaceAll("\\\\", "");
        suffix.append("('" +s +"','"+p+"','"+ o+"'),");
        count++;
      }
     //Build complete SQL
      String allsql = sql + suffix.substring(0, suffix.length()-1);
     //Add to execute SQL
      psts.addBatch(allsql);
      psts.executeBatch();//Execute batch processing
      conn.commit();//submit
    } catch (Exception e) {
      e.printStackTrace();
    }finally{
      if(psts!=null){
        psts.close();
      }
      if(conn!=null){
        conn.close();
      }
    }
  }

The advantage of this method is that it takes very little time to import data.It took exactly 1 hour for 65 million data.The disadvantage is that if there is a large sentence in the data, you need to deal with the commas, brackets, backslashes, etc.inside.Here you need to measure whether to use this method.

If it is inserted normally, that is, using the form of "insert into tb (...) values(...); insert into tb (...) values(...);...", there is no need to process Special symbols, but it will take a long time.I tested it and it took about 12 minutes to import 110,000 pieces of data.It takes about 100 hours to import 65 million pieces of data.

We use the first method.The data can probably be viewed, and the data requirements are not so strict, which saves time.

The above are the problems I encountered when importing large quantities of data into mysql, and the solutions I thought of.If you have a better solution, or encounter other problems, I hope to discuss together.

Articles you may be interested in:

  • Practical tutorial on importing data from sqlite3 into mysql
  • php implements the method of importing cvs data into MySQL based on the Fleaphp framework
  • Simple tutorial for data import and recovery in MySQL
  • PHP The method of importing data into MYSQL
  • The method of using the load data command in MySQL to import data
  • The method of parsing csv data into mysql
  • In Excel under phpMyAdmin How to import data into MySql with graphics and text
  • Detailed explanation of the method of importing pure IP data into MySQL in 3 steps
  • Implementation code for importing Excel data into Mysql database
  • How Import ACCESS data into Mysql
  • Import MYSQL big data

Tags

Technical otaku

Sought technology together

Related Topic

1 Comments

author

order generic atorvastatin 20mg & lt;a href="https://lipiws.top/"& gt;order atorvastatin online& lt;/a& gt; order atorvastatin pill

Cktuyf

2024-03-07

Leave a Reply

+