(For more resources on Phython see here.)
When we deal with large amounts of data that are all going into the same database, running single instances of individual INSERT commands can take a ridiculous amount of time and waste a considerable amount of I/O. What we need is a way to insert multiple items in one go.
In this article, Albert Lukaszewski, PhD, author of MySQL for Python, we will look at the following:
- How iteration can help us execute several individual INSERT statements rapidly
- Using executemany() to submit several INSERT statements in one go
- When not to use executemany()
- Throttling how much data is inserted at a time
You need to collate and rearrange the contents of several databases into one table whenever a given indicator is achieved (the indicator may be, among other things, a stock price, a value in one of the databases, or the flight pattern of African swallows). The format and length of the tables are predictable. There are 5,000 records in each table so manually running a separate INSERT statement for each record is not a viable option even with the programs used. The problem calls for a means of iterating through a series and changing the INSERT statement automatically. We could pursue one of the following two ways to do this:
- Write a MySQL script to insert the data in batch mode
- Iterate over the data to form and execute a MySQL INSERT statement accordingly
None of these are a very good solution to the present problem.
Why not a MySQL script?
As we have seen when we created the world and Sakila databases, a MySQL script can contain the schema of a database, the values of the database tables, or both. To create data quickly, there is nothing better. However, following are the several drawbacks to using a script in this scenario:
- Lack of automation
- Debugging the process
- Inefficient I/O
Lack of automation
Barring the use of an automation daemon (for example, cron) to run a cron job or a similar scheduled task, a DBA or their designate would have to run the script. This unnecessarily consumes time. It is comparable to swapping tape backups when automated backup services are available and proven.
Most modern computing systems support automatic task scheduling. On Unix-based systems like Linux and Mac OS X, one can schedule processes to be run at set times and on a regular rotation. One of the most widely used programs for such scheduling is cron. A single scheduled task in cron has thus come to be known as a cron job.
Debugging the process
Creating a MySQL script can be a very tedious task, and the slightest error or oversight can ruin the entire process. Using the –force flag causes MySQL to ignore errors in the source file. It is therefore not something that should be used regularly if one values the integrity of data. If the script is malformed for any reason, a two minute data insertion job can quickly become a two hour (at least!), unscheduled debugging process.
Dumping large amounts of data on MySQL can create latency across the network. If the script were run by a DBA or similar, that person should rightly evaluate the state of the network before running it. Regardless of experience and training, judgment calls naturally require estimation and can result in ambiguous decision-making. While this is unavoidable, it should be minimized where possible.
If the server is experiencing high traffic, the DBA would need to find something else to do and reschedule the running of the script. This randomly postpones the time of execution and the availability of the results. Also, it runs the risk of the DBA forgetting to execute the script.
If the script is automated with a cron job or similar, we risk dumping a lot of data onto MySQL at a time when others are doing more time sensitive tasks. On most servers, we can background the process, so it does not adversely impact the client processes. This, however, only ensures that the process will be run. It does not guarantee that the process will be finished by a particular time.
Why not iterate?
Every time a program iterates to read or insert data, it creates a certain amount of I/O processing. Depending on how a program is written will determine how much I/O is included in each loop.
A test sample: generating primes
To illustrate this, consider a program that accepts a series of numbers, generates the prime numbers that are equal to or less than each of those numbers and inserts those numbers into a database called primes with a table of the same name. The table has the following description:
mysql> describe primes;
and can be created with the following statement:
CREATE TABLE `primes` (`ID` int(11) NOT NULL auto_increment, `NUMBER`
int(11) NOT NULL default ‘0’, `PRIMES` varchar(300) NOT NULL default
‘0’, PRIMARY KEY (`ID`)) ENGINE=MyISAM DEFAULT CHARSET=latin1;
Using the sys module to handle user input, the string module to split the series of entries, and MySQLdb to handle the database connection, we get the following preamble:
import MySQLdb, string, sys
Next, we need a function that will generate the prime numbers to a certain limit called n.
“””Returns a list of prime numbers up to n using an algorithm
the Sieve of Eratosthenes.”””
if n < 2: return [‘A number less than two is not prime by
s = range(3,n+1,2)
maxfactor = n ** 0.5
half = (n+1)/2-1
i = 0
m = 3
while m <= maxfactor:
j = (m*m-3)/2
s[j] = 0
while j < half:
s[j] = 0
j += m
i = i + 1
m = 2 * i + 3
return str( + [x for x in s if x])
This algorithm is based on the Sieve of Eratosthenes, one of the simplest ways of generating prime numbers. It uses the following steps:
- Generate a list of integers from 2 to n.
- Discount the multiples of each number that remains and that has a square less than or equal to n; this leaves all the prime factors of n.
- Stop when the square of the number in the series is greater than n.
Prime numbers by definition have no other factor, but themselves and one. The lowest prime number is therefore 2. For this reason, we check whether n is less than 2 and return a message accordingly.
For this program, we want a string returned so we convert the results before we return them.
For more on the Sieve of Eratosthenes and how it works, see the entry on Wikipedia:
The previous algorithm can be found in many forms and in many languages on the Internet. Two of the best that informed this discussion are as follows:
Next, we create a function to form and execute the INSERT statement.
def insert(n, p, cur):
statement = “””INSERT INTO primes(number, primes) VALUES(“%s”,
This function takes the number, the primes, and the cursor object handle as arguments.
Finally, our main() function looks like this:
numbers = sys.argv iterations = numbers.split(‘,’)
for n in iterations:
mydb = MySQLdb.connect(host = ‘localhost’,
user = ‘skipper’,
passwd = ‘secret’,
db = ‘primes’)
cur = mydb.cursor()
n = int(n)
p = primes(n)
statement = insert(n, p, cur)
print “Data entered with the following statement:n”,
We split the values passed by the user and iterate through them with a for loop. Note that we include the database connection and cursor object creation as part of the iteration.
We then include the usual if clause to call main():
if __name__ == ‘__main__’:
Comparing execution speeds
We can test the speed of this program by giving it a series of simple primes to generate—even of the same number. Here we call it with a series of sevens to process and with the Unix command time to measure its speed.
time ./primes.py “7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
7, 7, 7, 7, 7, 7, 7″
Given this series of sevens, we get a real execution time of 0.175 seconds.
If we rearrange the program by moving the mydb and cur assignment lines to follow immediately after the preamble (remembering to adjust the indentation), we get an execution speed of 0.138 seconds. This difference in speed (0.037 seconds) is not particularly significant on a local system with a single user, but it would be magnified synergistically on a server with hundreds or thousands of users.
So, we see that, if the connection and cursor is created each time, the program will be less efficient than if the loop contained only the execution of the statement and the connection and cursor were persistent in the cycle of the program. However, even just passing a series of single statements to MySQL through Cursor.execute() will consume more in protocol than necessary. Excessive iteration consumes resources needlessly.