GITHUB engineer Timothy Klem in detail about technology about technology , used for the search engine of the source code from GitHub repositories called BlackBird. The source code search engine is designed on Rust and so far covers almost 45 million GitHub repositories, which together amount to 115 TB code and 15.5 billion documents. Github Code Search is currently in the stage of beta testing .
To move the number of lines of code, something more powerful than GREP is a regular command line tool in UNIX-like systems for searching in text data. Clement explained that the use of an Ripgrep on an 8-core Intel processor to execute a comprehensive request for a regular expression to a file of 13 GB in memory takes about 2.769 seconds or 0.6 GB/s on the nucleus.
“This will not work for a large amount of data that we have,” he said. “The search for the code works on 64 nuclei, 32 machine clusters. Even if we manage to place 115 TB code in memory and assume that we can perfectly parallelize the work, we are going to saturate 2048 cup cores for 96 seconds for servicing one request! Only this request can to be completed. Everyone else should stand in line, ”Klem explained.
At a speed of 0.01 requests per second, Gras was impossible. Therefore, GitHub has transferred most of the work to pre -calculated search indices. In fact, these are the cards of the key-value. This approach reduces computational costs when searching for the characteristics of the document, such as the programming language or a sequence of words, using a numerical key, not a text line.
Nevertheless, these indices are too large to fit in memory, so GitHub built iterators for each index to which he needed to access. They return sorted identifiers of documents that meet the criteria of the request.
In order to maintain the handling of the search index, GitHub is relied on on segmentation – dividing data into several parts using the hashing scheme with the addressing of the contents of the Git and the Delta -Coding – the conservation of data differences (delta) to reduce data and metadata, which must be scanned. This works well because GitHub has many excess data (for example, branches) – 115 TB data can be reduced to 25 TB using data deduplication methods.
The resulting system works much faster than the GREP – 640 requests per second compared to 0.01 requests per second. And indexing occurs at a speed of about 120,000 documents per second, therefore processing 15.5 billion documents takes about 36 hours, or 18 hours for repeated indexing, since delta indexation (change) reduces the number of documents that must be scanned.