Why is Python faster than C++ in this case?

2024/9/23 5:29:49

A program in both Python and C++ is given below, which performs the following task: read white-space delimited words from stdin, print the unique words sorted by string length along with a count of each unique word to stdout. The format for a line of the output is: length, count, word.

For exmaple, with this input file (488kB of a thesaurus) http://pastebin.com/raw.php?i=NeUBQ22T

The output, with formatting, is this:

1 57 "
1 1 n
1 1 )
1 3 *
1 18 ,
1 7 -
1 1 R
1 13 .
1 2 1
1 1 S
1 5 2
1 1 3
1 2 4
1 2 &
1 91 %
1 1 5
1 1 6
1 1 7
1 1 8
1 2 9
1 16 ;
1 2 =
1 5 A
1 1 C
1 5 e
1 3 E
1 1 G
1 11 I
1 1 L
1 4 N
1 681 a
1 2 y
1 1 P
2 1 67
2 1 y;
2 1 P-
2 85 no
2 9 ne
2 779 of
2 1 n;
...

Here is the program in C++

#include <vector>
#include <string>
#include <iostream>
#include <set>
#include <map>bool compare_strlen (const std::string &lhs, const std::string &rhs) {return (lhs.length() < rhs.length());
}int main (int argc, char *argv[]) {std::string str;std::vector<std::string> words;/* Extract words from the input file, splitting on whitespace */while (std::cin >> str) {words.push_back(str);}/* Extract unique words and count the number of occurances of each word */std::set<std::string> unique_words;std::map<std::string,int> word_count; for (std::vector<std::string>::iterator it = words.begin();it != words.end(); ++it) {unique_words.insert(*it);word_count[*it]++;}words.clear();std::copy(unique_words.begin(), unique_words.end(),std::back_inserter(words));// Sort by word length std::sort(words.begin(), words.end(), compare_strlen);// Print words with length and number of occurancesfor (std::vector<std::string>::iterator it = words.begin();it != words.end(); ++it) {std::cout << it->length() << " " << word_count[*it]  << " " <<*it << std::endl;}return 0;
}

Here is the Program in Python:

import fileinput
from collections import defaultdictwords = set()
count = {}
for line in fileinput.input():line_words = line.split()for word in line_words:if word not in words:words.add(word)count[word] = 1else:count[word] += 1 words = list(words)
words.sort(key=len)for word in words:print len(word), count[word], word

For the C++ program, the compiler used was g++ 4.9.0 with the -O3 flag.

The version of Python used was 2.7.3

Time taken for the C++ program:

time ./main > measure-and-count.txt < ~/Documents/thesaurus/thesaurus.txtreal    0m0.687s
user    0m0.559s
sys     0m0.123s

Time taken for the Python program:

time python main.py > measure-and-count.txt < ~/Documents/thesaurus/thesaurus.txtreal    0m0.369s
user    0m0.308s
sys     0m0.029s

The Python program is much faster than the C++ program, and is relatively even faster with larger input sizes. What's going on here? Is my use of the C++ STL incorrect?

Edit: As suggested by a comment and an answer, I changed the C++ program to use std::unordered_set and std::unordered_map.

The following lines were changed

#include <unordered_set>
#include <unordered_map>...std::unordered_set<std::string> unique_words;
std::unordered_map<std::string,int> word_count;

The compilation command:

g++-4.9 -std=c++11 -O3 -o main main.cpp

This improved performance only marginally:

time ./main > measure-and-count.txt < ~/Documents/thesaurus/thesaurus.txtreal    0m0.604s
user    0m0.479s
sys     0m0.122s

Edit2: A much faster program in C++

This is a combination of NetVipeC's solution, Dieter Lücking's solution, and the top answer to this question. The real performance killer was cin using an unbuffered read by default. Solved with std::cin.sync_with_stdio(false);. This solution also uses a single container, taking advantage of the ordered map in C++.

#include <vector>
#include <string>
#include <iostream>
#include <set>
#include <map>struct comparer_strlen {bool operator()(const std::string& lhs, const std::string& rhs) const {if (lhs.length() == rhs.length())return lhs < rhs;return lhs.length() < rhs.length();}
};int main(int argc, char* argv[]) {std::cin.sync_with_stdio(false);std::string str;typedef std::map<std::string, int, comparer_strlen> word_count_t;/* Extract words from the input file, splitting on whitespace *//* Extract unique words and count the number of occurances of each word */word_count_t word_count;while (std::cin >> str) {word_count[str]++;}// Print words with length and number of occurancesfor (word_count_t::iterator it = word_count.begin();it != word_count.end(); ++it) {std::cout << it->first.length() << " " << it->second << " "<< it->first << '\n';}return 0;
}

Runtime

time ./main3 > measure-and-count.txt < ~/Documents/thesaurus/thesaurus.txtreal    0m0.106s
user    0m0.091s
sys     0m0.012s

Edit3: A nice and concise version of the Python program was provided by Daniel, it runs in about the same time as the version above:

import fileinput
from collections import Countercount = Counter(w for line in fileinput.input() for w in line.split())for word in sorted(count, key=len):print len(word), count[word], word

Runtime:

time python main2.py > measure-and-count.txt.py < ~/Documents/thesaurus/thesaurus.txtreal    0m0.342s
user    0m0.312s
sys     0m0.027s
Answer

Test with this, it must be faster than the original C++.

Changes are:

  • Eliminated the vector words to save the words (there will be saved already in word_count).
  • Eliminated the set unique_words (in word_count are only the unique words).
  • Eliminated the second copy to words, not needed.
  • Eliminated the sort of the words (the order was updated in the map, now the words in the map are order by length and the words with same length are lexicographically order.

    #include <vector>
    #include <string>
    #include <iostream>
    #include <set>
    #include <map>struct comparer_strlen_functor {operator()(const std::string& lhs, const std::string& rhs) const {if (lhs.length() == rhs.length())return lhs < rhs;return lhs.length() < rhs.length();}
    };int main(int argc, char* argv[]) {std::cin.sync_with_stdio(false);std::string str;typedef std::map<std::string, int, comparer_strlen_functor> word_count_t;/* Extract words from the input file, splitting on whitespace *//* Extract unique words and count the number of occurances of each word */word_count_t word_count;while (std::cin >> str) {word_count[str]++;}// Print words with length and number of occurancesfor (word_count_t::iterator it = word_count.begin(); it != word_count.end();++it) {std::cout << it->first.length() << " " << it->second << " " << it->first<< "\n";}return 0;
    }
    

New version of the reading loop, to read line by line and split. Need #include <boost/algorithm/string/split.hpp>

while (std::getline(std::cin, str)) {for (string_split_iterator It = boost::make_split_iterator(str, boost::first_finder(" ", boost::is_iequal()));It != string_split_iterator(); ++It) {if (It->end() - It->begin() != 0)word_count[boost::copy_range<std::string>(*It)]++;}
}

Testing in Core i5, 8GB RAM, GCC 4.9.0, 32bits, run in 238ms. Updated the code with std::cin.sync_with_stdio(false); and \n as proposed.

https://en.xdnf.cn/q/71859.html

Related Q&A

Python - write headers to csv

Currently i am writing query in python which export data from oracle dbo to .csv file. I am not sure how to write headers within file. try:connection = cx_Oracle.connect(user,pass,tns_name)cursor = con…

Opening/Attempting to Read a file [duplicate]

This question already has answers here:PyCharm shows unresolved references error for valid code(31 answers)Closed 5 years ago.I tried to simply read and store the contents of a text file into an array,…

How to pass custom settings through CrawlerProcess in scrapy?

I have two CrawlerProcesses, each is calling different spider. I want to pass custom settings to one of these processes to save the output of the spider to csv, I thought I could do this:storage_setti…

numpy how to slice index an array using arrays?

Perhaps this has been raised and addressed somewhere else but I havent found it. Suppose we have a numpy array: a = np.arange(100).reshape(10,10) b = np.zeros(a.shape) start = np.array([1,4,7]) # ca…

How to import _ssl in python 2.7.6?

My http server is based on BaseHTTPServer with Python 2.7.6. Now I want it to support ssl transportation, so called https.I have installed pyOpenSSL and recompiled python source code with ssl support. …

Unexpected Indent error in Python [duplicate]

This question already has answers here:Im getting an IndentationError (or a TabError). How do I fix it?(6 answers)Closed 4 years ago.I have a simple piece of code that Im not understanding where my er…

pyshark can not capture the packet on windows 7 (python)

I want to capture the packet using pyshark. but I could not capture the packet on windows 7.this is my python codeimport pyshark def NetCap():print capturing...livecapture = pyshark.LiveCapture(interf…

How to get the Signal-to-Noise-Ratio from an image in Python?

I am filtering an image and I would like to know the SNR. I tried with the scipy function scipy.stats.signaltonoise() but I get an array of numbers and I dont really know what I am getting.Is there an…

Python and OpenCV - Cannot write readable avi video files

I have a code like this:import numpy as np import cv2cap = cv2.VideoCapture(C:/Users/Hilman/haatsu/drive_recorder/sample/3.mov)# Define the codec and create VideoWriter object fourcc = cv2.VideoWriter_…

Python as FastCGI under windows and apache

I need to run a simple request/response python module under an existing system with windows/apache/FastCGI.All the FastCGI wrappers for python I tried work for Linux only (they use socket.fromfd() and …