GP-4009 Introduced BSim functionality including support for postgresql,

elasticsearch and h2 databases.  Added BSim correlator to Version
Tracking.
This commit is contained in:
caheckman 2023-11-17 01:13:42 +00:00 committed by ghidra1
parent f0f5b8f2a4
commit 0865a3dfb0
509 changed files with 77125 additions and 934 deletions

View File

@ -1,8 +1,6 @@
##VERSION: 2.0
##MODULE IP: Apache License 2.0
##MODULE IP: Apache License 2.0 with LLVM Exceptions
.classpath||NONE||reviewed||END|
.project||NONE||reviewed||END|
FridaNotes.txt||GHIDRA||||END|
Module.manifest||GHIDRA||||END|
build.gradle||GHIDRA||||END|

View File

@ -1,8 +1,6 @@
##VERSION: 2.0
##MODULE IP: Apache License 2.0
##MODULE IP: Apache License 2.0 with LLVM Exceptions
.classpath||NONE||reviewed||END|
.project||NONE||reviewed||END|
Module.manifest||GHIDRA||||END|
build.gradle||GHIDRA||||END|
src/llvm-project/lldb/bindings/java/java-typemaps.swig||Apache License 2.0 with LLVM Exceptions||||END|

View File

@ -1,8 +1,6 @@
##VERSION: 2.0
##MODULE IP: Apache License 2.0
##MODULE IP: Apache License 2.0 with LLVM Exceptions
.classpath||NONE||reviewed||END|
.project||NONE||reviewed||END|
InstructionsForBuildingLLDBInterface.txt||GHIDRA||||END|
Module.manifest||GHIDRA||||END|
build.gradle||GHIDRA||||END|

View File

@ -0,0 +1,81 @@
Installation of the Elasticsearch BSim Plug-in:
In order to use Elasticsearch as the back-end database for a BSim instance,
the lsh plug-in, included with this Ghidra extension, must be installed on
the Elasticsearch cluster.
The lsh plug-in is bundled in the standard plug-in format as the file
'lsh.zip'. It must be installed separately on EVERY node of the cluster,
and each node must be restarted after the install in order for the plug-in to
become active.
For a single node, installation is accomplished with the command-line
'elasticsearch-plugin' script that comes with the standard Elasticsearch
distribution. It expects a URL pointing to the plug-in to be installed.
The basic command, executed in the Elasticsearch installation directory
for the node, is
bin/elasticsearch-plugin install file:///path/to/ghidra/Ghidra/Extensions/BSimElasticPlugin/data/lsh.zip
Replace the initial portion of the absolute path in the URL to point to your
particular Ghidra installation.
Deployment:
Follow the Elasticsearch documentation to do any additional configuration,
starting, stopping, and management of your Elasticsearch cluster.
To try BSim with a toy deployment, you can start a single node (as per the
documentation) from the command-line by just running
bin/elasticsearch
This will dump logging messages to the console, and you should see '[lsh]'
listed among the loaded plug-ins as the node starts up.
Once the Elasticsearch node(s) are running, whether they are a toy or a full
deployment, you can immediately proceed to the BSim 'bsim' command.
The Ghidra/BSim client and 'bsim' command automatically assume an
Elasticsearch server when they see the 'https' protocol in the provided URLs,
although the 'elastic" protocol may also be specified and is equivalent.
The use of the 'http' protocol for Elasticsearch is not supported.
Adjust the hostname, port number, and repository name as appropriate.
Use a command-line similar to the following to create a BSim instance:
bsim createdatabase elastic://1.2.3.4:9200/repo medium_32
This is equivalent to:
bsim createdatabase https://1.2.3.4:9200/repo medium_32
Use a command-line like this to generate and commit signatures from a Ghidra Server
repository to the Elasticsearch database created above:
bsim generatesigs ghidra://1.2.3.4/repo bsim=elastic://1.2.3.4:9200/repo
Within Ghidra's BSim client, enter the same URL into the database connection
panel in order to place queries to your Elasticsearch deployment. See the BSim
documentation included with Ghidra for full details.
Version:
The current BSim plug-in was designed and tested with Elasticsearch version 7.17.4.
A change to the Elasticsearch scripting interface, starting with version 7.15, makes the BSim
plug-in incompatible with previous versions, but the lsh plug-in jars may work without change
across later Elasticsearch versions.
Elasticsearch plug-ins explicitly encode the version of Elasticsearch they work with, and the
plug-in script will refuse to install the lsh plug-in if its version does not match your
particular installation. If your Elasticsearch version is slightly different, you can try
unpacking the zip file, changing the version number to match your software, and then repacking
the zip file. Within the zip archive, the version number is stored in a configuration file
elasticsearch/plugin-descriptor.properties
The file format is fairly simple: edit the line
elasticsearch.version=7.17.4
The plugin may work with other nearby versions, but proceed at your own risk.

View File

@ -0,0 +1,99 @@
/* ###
* IP: GHIDRA
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
apply from: "$rootProject.projectDir/gradle/distributableGhidraExtension.gradle"
apply from: "$rootProject.projectDir/gradle/javaProject.gradle"
apply plugin: 'eclipse'
eclipse.project.name = 'Xtra BSimElasticPlugin'
// This module is very different from other Ghidra modules. It is creating a stand-alone jar
// file for an elastic database plugin. It is copying files from other modules into this module
// before building a jar file from the files in this module and the cherry-picked files from
// other modules (This is very brittle and will break if any of the files are renamed or moved.)
project.ext.includeExtensionInInstallation = true
apply plugin: 'java'
sourceSets {
elasticPlugin {
java {
srcDirs = [ 'src', 'srcdummy', 'build/genericSrc', 'build/utilitySrc', 'build/bsimSrc' ]
}
}
}
// this dependency block is needed for this code to compile in our eclipse environment. It is not needed
// for the gradle build
dependencies {
implementation project(':BSim')
}
libsDirName='ziplayout'
task copyGenericTask(type: Copy) {
from project(':Generic').file('src/main/java')
into 'build/genericSrc'
include 'generic/lsh/vector/*.java'
include 'generic/hash/SimpleCRC32.java'
include 'ghidra/util/xml/SpecXmlUtils.java'
}
task copyUtilityTask(type: Copy) {
from project(':Utility').file('src/main/java')
into 'build/utilitySrc'
include 'ghidra/xml/XmlPullParser.java'
include 'ghidra/xml/XmlElement.java'
}
task copyBSimTask(type: Copy) {
from project(':BSim').file('src/main/java')
into 'build/bsimSrc'
include 'ghidra/features/bsim/query/elastic/ElasticUtilities.java'
include 'ghidra/features/bsim/query/elastic/Base64Lite.java'
include 'ghidra/features/bsim/query/elastic/Base64VectorFactory.java'
}
task copyPropertiesFile(type: Copy) {
from 'contribZipExclude/plugin-descriptor.properties'
into 'build/ziplayout'
}
task elasticPluginJar(type: Jar) {
from sourceSets.elasticPlugin.output
archiveBaseName = 'lsh'
excludes = [
'**/org/apache',
'**/org/elasticsearch/common',
'**/org/elasticsearch/env',
'**/org/elasticsearch/index',
'**/org/elasticsearch/indices',
'**/org/elasticsearch/plugins',
'**/org/elasticsearch/script',
'**/org/elasticsearch/search'
]
}
task elasticPluginZip(type: Zip) {
from 'build/ziplayout'
archiveBaseName = 'lsh'
destinationDirectory = file("build/data")
}
compileElasticPluginJava.dependsOn copyGenericTask
compileElasticPluginJava.dependsOn copyUtilityTask
compileElasticPluginJava.dependsOn copyBSimTask
elasticPluginZip.dependsOn elasticPluginJar
elasticPluginZip.dependsOn copyPropertiesFile
jar.dependsOn elasticPluginZip

View File

@ -0,0 +1,6 @@
##VERSION: 2.0
##MODULE IP: Apache License 2.0
INSTALL.txt||GHIDRA||||END|
Module.manifest||GHIDRA||reviewed||END|
contribZipExclude/plugin-descriptor.properties||GHIDRA||||END|
extension.properties||GHIDRA||||END|

View File

@ -0,0 +1,6 @@
description=Feature Vector Plugin
version=1.0
name=lsh
classname=org.elasticsearch.plugin.analysis.lsh.AnalysisLSHPlugin
java.version=1.11
elasticsearch.version=8.8.1

View File

@ -0,0 +1,5 @@
name=BSimElasticPlugin
description=Elastic search backend for BSim.
author=Ghidra Team
createdOn=11/23/20
version=@extversion@

View File

@ -0,0 +1,134 @@
/* ###
* IP: GHIDRA
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
package org.elasticsearch.plugin.analysis.lsh;
import java.io.IOException;
import java.util.*;
import org.elasticsearch.common.settings.Settings;
import org.elasticsearch.env.Environment;
import org.elasticsearch.index.IndexModule;
import org.elasticsearch.index.IndexSettings;
import org.elasticsearch.index.analysis.TokenizerFactory;
import org.elasticsearch.indices.analysis.AnalysisModule.AnalysisProvider;
import org.elasticsearch.plugins.*;
import org.elasticsearch.script.ScriptContext;
import org.elasticsearch.script.ScriptEngine;
import generic.lsh.vector.IDFLookup;
import generic.lsh.vector.WeightFactory;
import ghidra.features.bsim.query.elastic.Base64VectorFactory;
import ghidra.features.bsim.query.elastic.ElasticUtilities;
public class AnalysisLSHPlugin extends Plugin implements AnalysisPlugin, ScriptPlugin {
public static final String TOKENIZER_SETTINGS_BASE = "index.analysis.tokenizer.lsh_";
public static String settingString = "";
static private Map<String, Base64VectorFactory> vecFactoryMap = new HashMap<>();
private Map<String, AnalysisProvider<TokenizerFactory>> tokFactoryMap;
public class TokenizerFactoryProvider implements AnalysisProvider<TokenizerFactory> {
@Override
public TokenizerFactory get(IndexSettings indexSettings, Environment env, String name,
Settings settings) throws IOException {
// settingString = settingString + " : " + indexSettings.getIndex().getName() + '(' + name + ')';
return new LSHTokenizerFactory(indexSettings, env, name, settings);
}
}
public AnalysisLSHPlugin() {
TokenizerFactoryProvider provider = new TokenizerFactoryProvider();
tokFactoryMap = Collections.singletonMap("lsh_tokenizer", provider);
}
private static void setupVectorFactory(String name, String idfConfig, String lshWeights) {
WeightFactory weightFactory = new WeightFactory();
String[] split = lshWeights.split(" ");
double[] weightArray = new double[split.length];
for (int i = 0; i < weightArray.length; ++i) {
weightArray[i] = Double.parseDouble(split[i]);
}
weightFactory.set(weightArray);
IDFLookup idfLookup = new IDFLookup();
split = idfConfig.split(" ");
int[] intArray = new int[split.length];
for (int i = 0; i < intArray.length; ++i) {
intArray[i] = Integer.parseInt(split[i]);
}
idfLookup.set(intArray);
Base64VectorFactory vectorFactory = new Base64VectorFactory();
// Server-side factory is never used to generate signatures,
// so we don't need to specify settings
vectorFactory.set(weightFactory, idfLookup, 0);
vecFactoryMap.put(name, vectorFactory);
}
/**
* Entry point for Tokenizer and Script factories to grab the global vector factory
* @param name is the name of the tokenizer
* @return the vector factory used by the tokenizer
*/
public static Base64VectorFactory getVectorFactory(String name) {
return vecFactoryMap.get(name);
}
@Override
public void onIndexModule(IndexModule indexModule) {
super.onIndexModule(indexModule);
Settings settings = indexModule.getSettings();
String name = null;
// Look for the specific kind of tokenizer settings, within the global settings for the index
for (String key : settings.keySet()) {
if (key.startsWith(TOKENIZER_SETTINGS_BASE)) {
// We can have different settings for different indices, distinguished by this name
int pos = key.indexOf('.', TOKENIZER_SETTINGS_BASE.length() + 1);
if (pos > 0) {
name = key.substring(TOKENIZER_SETTINGS_BASE.length(), pos);
break;
}
}
}
if (name != null) {
String tokenizerName = "lsh_" + name;
if (getVectorFactory(tokenizerName) != null) {
return; // Factory already exists
}
settingString = settingString + " : onModule(" + name + ')';
// If we found LSH tokenizer settings, pull them out and construct an LSHVectorFactory with them
String baseKey = TOKENIZER_SETTINGS_BASE + name + '.';
String idfConfig = settings.get(baseKey + ElasticUtilities.IDF_CONFIG);
String lshWeights = settings.get(baseKey + ElasticUtilities.LSH_WEIGHTS);
if (idfConfig == null || lshWeights == null) {
return; // IDF_CONFIG and LSH_WEIGHTS settings must be present to proceed
}
setupVectorFactory(tokenizerName, idfConfig, lshWeights);
}
}
@Override
public ScriptEngine getScriptEngine(Settings settings, Collection<ScriptContext<?>> contexts) {
return new BSimScriptEngine();
}
@Override
public Map<String, AnalysisProvider<TokenizerFactory>> getTokenizers() {
return tokFactoryMap;
}
}

View File

@ -0,0 +1,54 @@
/* ###
* IP: GHIDRA
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
package org.elasticsearch.plugin.analysis.lsh;
import java.util.*;
import org.elasticsearch.script.*;
public class BSimScriptEngine implements ScriptEngine {
private final static String ENGINE_NAME = "bsim_scripts";
@Override
public <FactoryType> FactoryType compile(String scriptName, String scriptSource,
ScriptContext<FactoryType> context, Map<String, String> params) {
if (context.equals(ScoreScript.CONTEXT) == false) {
throw new IllegalArgumentException(
getType() + "scripts cannot be used for context [" + context.name + "]");
}
if (VectorCompareScriptFactory.SCRIPT_NAME.equals(scriptSource)) {
ScoreScript.Factory factory = new VectorCompareScriptFactory();
return context.factoryClazz.cast(factory);
}
throw new IllegalArgumentException("Unknown script name " + scriptSource);
}
@Override
public void close() {
// Can free up resources
}
@Override
public Set<ScriptContext<?>> getSupportedContexts() {
return Collections.singleton(ScoreScript.CONTEXT);
}
@Override
public String getType() {
return ENGINE_NAME;
}
}

View File

@ -0,0 +1,293 @@
/* ###
* IP: GHIDRA
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
package org.elasticsearch.plugin.analysis.lsh;
import generic.lsh.vector.HashEntry;
import ghidra.features.bsim.query.elastic.Base64Lite;
/**
* Class for calculating the bin ids on LSHVectors as part of the LSH indexing process
*
*/
public class LSHBinner {
private static final char[] hashSignTable = new char[512];
private static int VEC_SIZE_UPPER = 5; // Size above which to use FFT to calculate dotproduct family
private static int LSH_HASHBASE = 0xd7e6a299;
private static int HASH_MULTIPLIER = 1103515245;
private static int HASH_ADDEND = 12345;
public static class BytesRef {
public char[] buffer;
public BytesRef(int size) { buffer = new char[size]; }
}
private int k; // Number of bits per bin id
private int L; // Number of binnings
private double doubleBuffer[]; // Scratch space for dot-product calculation
private BytesRef tokenList[]; // Final token list used by lucene
static {
/**
* This is a precalculated table for generating dot-products with the random family of vectors directly
* The first vector r_0 is expressed as a hashing function on the dimension index and the other vectors
* are derived from r_0 using an FFT. The table is formed by precalculating the FFT on basis vectors in this table
*/
int i, j;
int[] arr = new int[16];
int hibit0ptr;
int hibit1ptr;
for (i = 0; i < 16; ++i) { /* For each 4-bit position */
hibit0ptr = i * 16;
hibit1ptr = (i + 16) * 16;
for (j = 0; j < 16; ++j)
arr[j] = 0;
arr[i] = 1;
hashFft16(arr);
for (j = 0; j < 16; ++j) {
if (arr[j] > 0) {
hashSignTable[hibit0ptr + j] = '+';
hashSignTable[hibit1ptr + j] = '-';
} else {
hashSignTable[hibit0ptr + j] = '-';
hashSignTable[hibit1ptr + j] = '+';
}
}
}
}
/**
* Raw Fast Fourier Transform on 16 wide integer array
* @param arr is the 16-long array
*/
private static void hashFft16(int[] arr) {
int x,y;
x = arr[0]; y = arr[8]; arr[0] = x + y; arr[8] = x - y;
x = arr[1]; y = arr[9]; arr[1] = x + y; arr[9] = x - y;
x = arr[2]; y = arr[10]; arr[2] = x + y; arr[10] = x - y;
x = arr[3]; y = arr[11]; arr[3] = x + y; arr[11] = x - y;
x = arr[4]; y = arr[12]; arr[4] = x + y; arr[12] = x - y;
x = arr[5]; y = arr[13]; arr[5] = x + y; arr[13] = x - y;
x = arr[6]; y = arr[14]; arr[6] = x + y; arr[14] = x - y;
x = arr[7]; y = arr[15]; arr[7] = x + y; arr[15] = x - y;
x = arr[0]; y = arr[4]; arr[0] = x + y; arr[4] = x - y;
x = arr[1]; y = arr[5]; arr[1] = x + y; arr[5] = x - y;
x = arr[2]; y = arr[6]; arr[2] = x + y; arr[6] = x - y;
x = arr[3]; y = arr[7]; arr[3] = x + y; arr[7] = x - y;
x = arr[8]; y = arr[12]; arr[8] = x + y; arr[12] = x - y;
x = arr[9]; y = arr[13]; arr[9] = x + y; arr[13] = x - y;
x = arr[10]; y = arr[14]; arr[10] = x + y; arr[14] = x - y;
x = arr[11]; y = arr[15]; arr[11] = x + y; arr[15] = x - y;
x = arr[0]; y = arr[2]; arr[0] = x + y; arr[2] = x - y;
x = arr[1]; y = arr[3]; arr[1] = x + y; arr[3] = x - y;
x = arr[4]; y = arr[6]; arr[4] = x + y; arr[6] = x - y;
x = arr[5]; y = arr[7]; arr[5] = x + y; arr[7] = x - y;
x = arr[8]; y = arr[10]; arr[8] = x + y; arr[10] = x - y;
x = arr[9]; y = arr[11]; arr[9] = x + y; arr[11] = x - y;
x = arr[12]; y = arr[14]; arr[12] = x + y; arr[14] = x - y;
x = arr[13]; y = arr[15]; arr[13] = x + y; arr[15] = x - y;
x = arr[0]; y = arr[1]; arr[0] = x + y; arr[1] = x - y;
x = arr[2]; y = arr[3]; arr[2] = x + y; arr[3] = x - y;
x = arr[4]; y = arr[5]; arr[4] = x + y; arr[5] = x - y;
x = arr[6]; y = arr[7]; arr[6] = x + y; arr[7] = x - y;
x = arr[8]; y = arr[9]; arr[8] = x + y; arr[9] = x - y;
x = arr[10]; y = arr[11]; arr[10] = x + y; arr[11] = x - y;
x = arr[12]; y = arr[13]; arr[12] = x + y; arr[13] = x - y;
x = arr[14]; y = arr[15]; arr[14] = x + y; arr[15] = x - y;
}
/**
* Raw Fast Fourier Transform on 16 wide array of doubles
* @param arr is the 16-long array
*/
private static void hashFft16(double[] arr) {
double x,y;
x = arr[0]; y = arr[8]; arr[0] = x + y; arr[8] = x - y;
x = arr[1]; y = arr[9]; arr[1] = x + y; arr[9] = x - y;
x = arr[2]; y = arr[10]; arr[2] = x + y; arr[10] = x - y;
x = arr[3]; y = arr[11]; arr[3] = x + y; arr[11] = x - y;
x = arr[4]; y = arr[12]; arr[4] = x + y; arr[12] = x - y;
x = arr[5]; y = arr[13]; arr[5] = x + y; arr[13] = x - y;
x = arr[6]; y = arr[14]; arr[6] = x + y; arr[14] = x - y;
x = arr[7]; y = arr[15]; arr[7] = x + y; arr[15] = x - y;
x = arr[0]; y = arr[4]; arr[0] = x + y; arr[4] = x - y;
x = arr[1]; y = arr[5]; arr[1] = x + y; arr[5] = x - y;
x = arr[2]; y = arr[6]; arr[2] = x + y; arr[6] = x - y;
x = arr[3]; y = arr[7]; arr[3] = x + y; arr[7] = x - y;
x = arr[8]; y = arr[12]; arr[8] = x + y; arr[12] = x - y;
x = arr[9]; y = arr[13]; arr[9] = x + y; arr[13] = x - y;
x = arr[10]; y = arr[14]; arr[10] = x + y; arr[14] = x - y;
x = arr[11]; y = arr[15]; arr[11] = x + y; arr[15] = x - y;
x = arr[0]; y = arr[2]; arr[0] = x + y; arr[2] = x - y;
x = arr[1]; y = arr[3]; arr[1] = x + y; arr[3] = x - y;
x = arr[4]; y = arr[6]; arr[4] = x + y; arr[6] = x - y;
x = arr[5]; y = arr[7]; arr[5] = x + y; arr[7] = x - y;
x = arr[8]; y = arr[10]; arr[8] = x + y; arr[10] = x - y;
x = arr[9]; y = arr[11]; arr[9] = x + y; arr[11] = x - y;
x = arr[12]; y = arr[14]; arr[12] = x + y; arr[14] = x - y;
x = arr[13]; y = arr[15]; arr[13] = x + y; arr[15] = x - y;
x = arr[0]; y = arr[1]; arr[0] = x + y; arr[1] = x - y;
x = arr[2]; y = arr[3]; arr[2] = x + y; arr[3] = x - y;
x = arr[4]; y = arr[5]; arr[4] = x + y; arr[5] = x - y;
x = arr[6]; y = arr[7]; arr[6] = x + y; arr[7] = x - y;
x = arr[8]; y = arr[9]; arr[8] = x + y; arr[9] = x - y;
x = arr[10]; y = arr[11]; arr[10] = x + y; arr[11] = x - y;
x = arr[12]; y = arr[13]; arr[12] = x + y; arr[13] = x - y;
x = arr[14]; y = arr[15]; arr[14] = x + y; arr[15] = x - y;
}
public LSHBinner() {
doubleBuffer = new double[16];
k = -1;
L = -1;
tokenList = null;
}
public void setKandL(int k,int L) {
this.k = k;
this.L = L;
int numBits = 1;
while( (1 << numBits) <= L )
numBits += 1;
numBits += k;
int numChar = numBits / 6;
if ((numBits % 6)!= 0)
numChar += 1;
tokenList = new BytesRef[L];
for(int i=0;i<L;++i) {
tokenList[i] = new BytesRef(numChar);
}
}
public BytesRef[] getTokenList() {
return tokenList;
}
/**
* Generate a dot product of the hash vector in -vec- with a random family of 16 vectors, { r }
* r_0 is a randomly generated set of +1 -1 coefficients across all the dimensions (indexed by uint32 vec[i].hash)
* The coefficient is calculated as a hashing function from the seed -hashcur- and the index (vec[i].hash),
* so it should be balanced between +1 and -1.
* All the other vectors are generated from an FFT of r_0. This allows the dotproduct with vec to be calculated
* using an FFT if -vec- has many non-zero coefficients. If -vec- has only a few non-zero coefficients,
* the dotproduct if calculated with each vector in the family directly for better efficiency.
* The resulting dotproducts are converted into a 16-long bitvector based on the sign of the dotproduct and
* placed in -bucket-
* @param bucket is the (possibly partially filled) accumulator for dotproduct bits
* @param vec is the HashEntry vector to calculate the dot-products on
* @param hashcur is the index of the hash subfamily to representing r_0
* @param res is space (a 16-long double array) for the in-place FFT
* @return the bucket with new accumulated dot-product bits
*/
private int hash16DotProduct(int bucket,HashEntry[] vec,int hashcur)
{
int i, j;
int rowNum;
int signPtr;
for (i = 0; i < 16; ++i)
doubleBuffer[i] = 0.0; // Initialize the dotproduct results to zero
if (vec.length < VEC_SIZE_UPPER) { // If there are a small number of non-zero coefficients in -vec-
for (i = 0; i < vec.length; ++i) {
rowNum = vec[i].getHash() ^ hashcur; // Calculate the rest of the r_0 hashing function
rowNum = (rowNum * HASH_MULTIPLIER) + HASH_ADDEND;
rowNum = (rowNum >>> 24) & 0x1f;
signPtr = rowNum * 16;
for (j = 0; j < 16; ++j) { // Based on the precalculated coeff table calculate this portion of dotproduct
if (hashSignTable[signPtr + j] == '+')
doubleBuffer[j] += vec[i].getCoeff(); // Dot product with +1 // coeff
else
doubleBuffer[j] -= vec[i].getCoeff(); // Dot product with -1 // coeff
}
}
}
else { // If we have many non-zero coefficients in -vec-
for (i = 0; i < vec.length; ++i) {
rowNum = vec[i].getHash() ^ hashcur; // Calculate the rest of the r_0 hashing function
rowNum = (rowNum * HASH_MULTIPLIER) + HASH_ADDEND;
rowNum = (rowNum >>> 24) & 0x1f;
if (rowNum < 0x10) // Set-up for the FFT
doubleBuffer[rowNum] += vec[i].getCoeff();
else
doubleBuffer[rowNum & 0xf] -= vec[i].getCoeff();
}
hashFft16(doubleBuffer); // Calculate the remaining dot-products be performing FFT
}
for (i = 0; i < 16; ++i) { // Convert the dot-product results to a bit-vector
bucket <<= 1;
if (doubleBuffer[i] > 0.0)
bucket |= 1;
}
return bucket;
}
public void generateBinIds(HashEntry[] vec)
{
int bucket = 0;
int bucketcnt = 0;
int i,bitsleft;
int curid;
int mask,val;
int hashbase = LSH_HASHBASE;
for (i = 0; i < L; ++i) {
curid = i; // Tack-on bits that indicate the particular table this bin id belongs to
bitsleft = k;
do {
if (bucketcnt == 0) {
hashbase = (hashbase * HASH_MULTIPLIER) + HASH_ADDEND;
bucket = hash16DotProduct(bucket, vec, hashbase);
bucketcnt += 16;
}
if (bucketcnt >= bitsleft) {
curid <<= bitsleft;
mask = 1;
mask = (mask << bitsleft) - 1;
val = bucket >>> (bucketcnt - bitsleft);
curid |= (val & mask);
bucketcnt -= bitsleft;
bitsleft = 0;
} else {
curid <<= bucketcnt;
mask = 1;
mask = (mask << bucketcnt) - 1;
curid |= (bucket & mask);
bitsleft -= bucketcnt;
bucketcnt = 0;
}
} while (bitsleft > 0);
char[] token = tokenList[i].buffer;
for(int j=0;j<token.length;++j) {
token[j] = Base64Lite.encode[curid & 0x3f]; // encode 6 bits
curid >>= 6; // move to next 6 bits
}
}
}
}

View File

@ -0,0 +1,68 @@
/* ###
* IP: GHIDRA
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
package org.elasticsearch.plugin.analysis.lsh;
import java.io.IOException;
import org.apache.lucene.analysis.Tokenizer;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.elasticsearch.plugin.analysis.lsh.LSHBinner.BytesRef;
import generic.lsh.vector.LSHVector;
import ghidra.features.bsim.query.elastic.Base64VectorFactory;
public class LSHTokenizer extends Tokenizer {
private final CharTermAttribute bytesAtt = addAttribute(CharTermAttribute.class);
private BytesRef[] tokens;
private int pos; // Number of terms/tokens returned so far
private Base64VectorFactory vectorFactory;
private LSHBinner binner;
private char[] vecBuffer;
public LSHTokenizer(int k,int L,Base64VectorFactory vFactory) {
super(DEFAULT_TOKEN_ATTRIBUTE_FACTORY);
vectorFactory = vFactory;
binner = new LSHBinner();
binner.setKandL(k, L);
pos = -1;
vecBuffer = Base64VectorFactory.allocateBuffer();
}
@Override
public boolean incrementToken() throws IOException {
clearAttributes();
if (pos < 0) {
LSHVector vector = vectorFactory.restoreVectorFromBase64(input,vecBuffer);
// AnalysisLSHPlugin.settingString = AnalysisLSHPlugin.settingString + " : " + Long.toHexString(vector.calcUniqueHash());
binner.generateBinIds(vector.getEntries());
tokens = binner.getTokenList();
pos = 0;
}
if (pos < tokens.length) {
char[] buffer = tokens[pos].buffer;
bytesAtt.copyBuffer(buffer,0,buffer.length);
pos += 1;
return true;
}
return false;
}
@Override
public void reset() throws IOException {
super.reset();
pos = -1;
}
}

View File

@ -0,0 +1,44 @@
/* ###
* IP: GHIDRA
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
package org.elasticsearch.plugin.analysis.lsh;
import org.apache.lucene.analysis.Tokenizer;
import org.elasticsearch.common.settings.Settings;
import org.elasticsearch.env.Environment;
import org.elasticsearch.index.IndexSettings;
import org.elasticsearch.index.analysis.AbstractTokenizerFactory;
import ghidra.features.bsim.query.elastic.Base64VectorFactory;
import ghidra.features.bsim.query.elastic.ElasticUtilities;
public class LSHTokenizerFactory extends AbstractTokenizerFactory {
private Base64VectorFactory vectorFactory;
private int k;
private int L;
public LSHTokenizerFactory(IndexSettings indexSettings, Environment environment, String name, Settings settings) {
super(indexSettings, settings, name);
k = settings.getAsInt(ElasticUtilities.K_SETTING, -1);
L = settings.getAsInt(ElasticUtilities.L_SETTING, -1);
vectorFactory = AnalysisLSHPlugin.getVectorFactory(name);
}
@Override
public Tokenizer create() {
return new LSHTokenizer(k,L,vectorFactory);
}
}

View File

@ -0,0 +1,147 @@
/* ###
* IP: GHIDRA
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
package org.elasticsearch.plugin.analysis.lsh;
import java.io.*;
import java.util.Map;
import org.apache.lucene.document.Document;
import org.apache.lucene.util.BytesRef;
import org.elasticsearch.script.*;
import org.elasticsearch.script.ScoreScript.LeafFactory;
import org.elasticsearch.search.lookup.SearchLookup;
import generic.lsh.vector.LSHVector;
import generic.lsh.vector.VectorCompare;
import ghidra.features.bsim.query.elastic.Base64VectorFactory;
public class VectorCompareScriptFactory implements ScoreScript.Factory {
public final static String SCRIPT_NAME = "lsh_compare";
public final static String FEATURES_NAME = "{\"features\":\"";
@Override
public boolean isResultDeterministic() {
return true;
}
@Override
public LeafFactory newFactory(Map<String, Object> params, SearchLookup lookup) {
return new VectorCompareLeafFactory(params, lookup);
}
private static class VectorCompareLeafFactory implements LeafFactory {
private final Map<String, Object> params;
private final SearchLookup lookup;
private LSHVector baseVector; // Vector being compared to everything
private final double simthresh; // Similarity threshold
private final double sigthresh; // Significance threshold
private final Base64VectorFactory vectorFactory; // Factory used for this particular query
private VectorCompareLeafFactory(Map<String, Object> params, SearchLookup lookup) {
this.params = params;
this.lookup = lookup;
vectorFactory = AnalysisLSHPlugin.getVectorFactory((String) params.get("indexname"));
simthresh = (Double) params.get("simthresh");
sigthresh = (Double) params.get("sigthresh");
StringReader reader = new StringReader((String) params.get("vector"));
try {
baseVector = vectorFactory.restoreVectorFromBase64(reader,
Base64VectorFactory.allocateBuffer());
}
catch (IOException e) {
baseVector = null;
}
}
@Override
public boolean needs_score() {
return false;
}
private static int scanForFeatures(byte[] buffer, int offset) throws IOException {
int i = 0;
while (i < FEATURES_NAME.length()) {
char curChar = FEATURES_NAME.charAt(i);
int val = buffer[offset];
if (val == curChar) {
i += 1;
offset += 1;
}
else if (val == ' ' || val == '\t') {
offset += 1;
}
else {
throw new IOException("Document is missing \"features\"");
}
}
return offset;
}
private static int scanForLength(BytesRef byteRef, int startOffset) throws IOException {
int finalLength = 0;
int maxLength = byteRef.length - (startOffset - byteRef.offset);
while (finalLength < maxLength) {
if (byteRef.bytes[finalLength + startOffset] == '\"') {
break;
}
finalLength += 1;
}
if (finalLength == byteRef.length) {
throw new IOException("Document does not contain complete \"features\"");
}
return finalLength;
}
@Override
public ScoreScript newInstance(DocReader docReader) throws IOException {
return new ScoreScript(params, lookup, docReader) {
@Override
public double execute(ExplanationHolder explanation) {
try {
DocValuesDocReader dvReader = (DocValuesDocReader) docReader;
Document document =
dvReader.getLeafReaderContext().reader().document(_getDocId());
BytesRef byteRef = document.getField("_source").binaryValue();
int valOffset = scanForFeatures(byteRef.bytes, byteRef.offset);
int finalLength = scanForLength(byteRef, valOffset);
InputStream inputStream =
new ByteArrayInputStream(byteRef.bytes, valOffset, finalLength);
Reader reader = new InputStreamReader(inputStream);
// Should be sharing the VectorCompare between different calls
// but apparently this routine needs to be thread safe, so we allocate it per call
VectorCompare vectorCompare = new VectorCompare();
LSHVector curVec = vectorFactory.restoreVectorFromBase64(reader,
Base64VectorFactory.allocateBuffer());
double sim = baseVector.compare(curVec, vectorCompare);
if (sim <= simthresh) {
return 0.0;
}
double sig = vectorFactory.calculateSignificance(vectorCompare);
if (sig <= sigthresh) {
return 0.0;
}
return sim;
}
catch (IOException e) {
return 0.0;
}
}
};
}
}
}

View File

@ -0,0 +1,29 @@
/* ###
* IP: GHIDRA
* NOTE: Dummy placeholder for lucene class
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
package org.apache.lucene.analysis;
import java.io.Closeable;
import java.io.IOException;
import org.apache.lucene.util.AttributeFactory;
import org.apache.lucene.util.AttributeSource;
public abstract class TokenStream extends AttributeSource implements Closeable {
public static final AttributeFactory DEFAULT_TOKEN_ATTRIBUTE_FACTORY = null;
public abstract boolean incrementToken() throws IOException;
}

View File

@ -0,0 +1,38 @@
/* ###
* IP: GHIDRA
* NOTE: Dummy placeholder for lucene class
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
package org.apache.lucene.analysis;
import java.io.IOException;
import java.io.Reader;
import org.apache.lucene.util.AttributeFactory;
public abstract class Tokenizer extends TokenStream {
protected Reader input;
protected Tokenizer(AttributeFactory factory) {
}
@Override
public void close() throws IOException {
}
public void reset() throws IOException {
}
}

View File

@ -0,0 +1,25 @@
/* ###
* IP: GHIDRA
* NOTE: Dummy placeholder for lucene interface
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
package org.apache.lucene.analysis.tokenattributes;
import org.apache.lucene.util.Attribute;
public interface CharTermAttribute extends Attribute, CharSequence, Appendable {
public void copyBuffer(char[] buffer, int offset, int length);
}

View File

@ -0,0 +1,26 @@
/* ###
* IP: GHIDRA
* NOTE: Dummy placeholder for lucene class
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
package org.apache.lucene.document;
import org.apache.lucene.index.IndexableField;
public class Document {
public final IndexableField getField(String name) {
return null;
}
}

View File

@ -0,0 +1,27 @@
/* ###
* IP: GHIDRA
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
package org.apache.lucene.index;
import java.io.Closeable;
import java.io.IOException;
import org.apache.lucene.document.Document;
public abstract class IndexReader implements Closeable {
public final Document document(int docID) throws IOException {
return null;
}
}

View File

@ -0,0 +1,21 @@
/* ###
* IP: GHIDRA
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
package org.apache.lucene.index;
public abstract class IndexReaderContext {
public abstract IndexReader reader();
}

View File

@ -0,0 +1,23 @@
/* ###
* IP: GHIDRA
* NOTE: Dummy placeholder for lucene interface
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
package org.apache.lucene.index;
import org.apache.lucene.util.BytesRef;
public interface IndexableField {
public BytesRef binaryValue();
}

View File

@ -0,0 +1,21 @@
/* ###
* IP: GHIDRA
* NOTE: Dummy placeholder for lucene class
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
package org.apache.lucene.index;
public abstract class LeafReader extends IndexReader {
}

View File

@ -0,0 +1,24 @@
/* ###
* IP: GHIDRA
* NOTE: Dummy placeholder for lucene class
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
package org.apache.lucene.index;
public final class LeafReaderContext extends IndexReaderContext {
@Override
public LeafReader reader() {
return null;
}
}

View File

@ -0,0 +1,21 @@
/* ###
* IP: GHIDRA
* NOTE: Dummy placeholder for lucene interface
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
package org.apache.lucene.util;
public interface Attribute {
}

View File

@ -0,0 +1,20 @@
/* ###
* IP: GHIDRA
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
package org.apache.lucene.util;
public abstract class AttributeFactory {
}

View File

@ -0,0 +1,27 @@
/* ###
* IP: GHIDRA
* NOTE: Dummy placeholder for lucene class
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
package org.apache.lucene.util;
public class AttributeSource {
public final <T extends Attribute> T addAttribute(Class<T> attClass) {
return null;
}
public final void clearAttributes() {
}
}

View File

@ -0,0 +1,23 @@
/* ###
* IP: GHIDRA
* NOTE: Dummy placeholder for lucene class
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
package org.apache.lucene.util;
public class BytesRef {
public byte[] bytes;
public int length;
public int offset;
}

View File

@ -0,0 +1,34 @@
/* ###
* IP: GHIDRA
* NOTE: Dummy placeholder for elasticsearch class
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
package org.elasticsearch.common.settings;
import java.util.Set;
public class Settings {
public Integer getAsInt(String setting, Integer defaultValue) {
return null;
}
public String get(String setting) {
return null;
}
public Set<String> keySet() {
return null;
}
}

View File

@ -0,0 +1,21 @@
/* ###
* IP: GHIDRA
* NOTE: Dummy placeholder for elasticsearch class
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
package org.elasticsearch.env;
public class Environment {
}

View File

@ -0,0 +1,26 @@
/* ###
* IP: GHIDRA
* NOTE: Dummy placeholder for elasticsearch class
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
package org.elasticsearch.index;
import org.elasticsearch.common.settings.Settings;
public class IndexModule {
public Settings getSettings() {
return null;
}
}

View File

@ -0,0 +1,21 @@
/* ###
* IP: GHIDRA
* NOTE: Dummy placeholder for elasticsearch class
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
package org.elasticsearch.index;
public final class IndexSettings {
}

View File

@ -0,0 +1,27 @@
/* ###
* IP: GHIDRA
* NOTE: Dummy placeholder for elasticsearch class
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
package org.elasticsearch.index.analysis;
import org.elasticsearch.common.settings.Settings;
import org.elasticsearch.index.IndexSettings;
public abstract class AbstractTokenizerFactory implements TokenizerFactory {
public AbstractTokenizerFactory(IndexSettings indexSettings, Settings settings, String name) {
}
}

View File

@ -0,0 +1,24 @@
/* ###
* IP: GHIDRA
* NOTE: Dummy placeholder for elasticsearch interface
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
package org.elasticsearch.index.analysis;
import org.apache.lucene.analysis.Tokenizer;
public interface TokenizerFactory {
Tokenizer create();
}

View File

@ -0,0 +1,31 @@
/* ###
* IP: GHIDRA
* NOTE: Dummy placeholder for elasticsearch class
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
package org.elasticsearch.indices.analysis;
import java.io.IOException;
import org.elasticsearch.common.settings.Settings;
import org.elasticsearch.env.Environment;
import org.elasticsearch.index.IndexSettings;
public class AnalysisModule {
public interface AnalysisProvider<T> {
T get(IndexSettings indexSettings, Environment environment, String name, Settings settings)
throws IOException;
}
}

View File

@ -0,0 +1,27 @@
/* ###
* IP: GHIDRA
* NOTE: Dummy placeholder for elasticsearch interface
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
package org.elasticsearch.plugins;
import java.util.Map;
import org.elasticsearch.index.analysis.TokenizerFactory;
import org.elasticsearch.indices.analysis.AnalysisModule.AnalysisProvider;
public interface AnalysisPlugin {
Map<String, AnalysisProvider<TokenizerFactory>> getTokenizers();
}

View File

@ -0,0 +1,32 @@
/* ###
* IP: GHIDRA
* NOTE: Dummy placeholder for elasticsearch class
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
package org.elasticsearch.plugins;
import java.io.Closeable;
import java.io.IOException;
import org.elasticsearch.index.IndexModule;
public abstract class Plugin implements Closeable {
public void onIndexModule(IndexModule indexModule) {
}
@Override
public void close() throws IOException {
}
}

View File

@ -0,0 +1,28 @@
/* ###
* IP: GHIDRA
* NOTE: Dummy placeholder for elasticsearch interface
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
package org.elasticsearch.plugins;
import java.util.Collection;
import org.elasticsearch.common.settings.Settings;
import org.elasticsearch.script.ScriptContext;
import org.elasticsearch.script.ScriptEngine;
public interface ScriptPlugin {
ScriptEngine getScriptEngine(Settings settings, Collection<ScriptContext<?>> contexts);
}

View File

@ -0,0 +1,21 @@
/* ###
* IP: GHIDRA
* NOTE: Dummy placeholder for elasticsearch interface
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
package org.elasticsearch.script;
public interface DocReader {
}

View File

@ -0,0 +1,28 @@
/* ###
* IP: GHIDRA
* NOTE: Dummy placeholder for elasticsearch class
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
package org.elasticsearch.script;
import org.apache.lucene.index.LeafReaderContext;
public class DocValuesDocReader implements DocReader, LeafReaderContextSupplier {
@Override
public LeafReaderContext getLeafReaderContext() {
return null;
}
}

View File

@ -0,0 +1,23 @@
/* ###
* IP: GHIDRA
* NOTE: Dummy placeholder for elasticsearch interface
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
package org.elasticsearch.script;
import org.apache.lucene.index.LeafReaderContext;
public interface LeafReaderContextSupplier {
LeafReaderContext getLeafReaderContext();
}

View File

@ -0,0 +1,50 @@
/* ###
* IP: GHIDRA
* NOTE: Dummy placeholder for elasticsearch class
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
package org.elasticsearch.script;
import java.io.IOException;
import java.util.Map;
import org.elasticsearch.search.lookup.SearchLookup;
public abstract class ScoreScript {
public ScoreScript(Map<String, Object> params, SearchLookup searchLookup, DocReader docReader) {
}
public static class ExplanationHolder {
}
public static final ScriptContext<ScoreScript.Factory> CONTEXT = null;
public interface Factory extends ScriptFactory {
LeafFactory newFactory(Map<String, Object> params, SearchLookup lookup);
}
public interface LeafFactory {
boolean needs_score();
ScoreScript newInstance(DocReader reader) throws IOException;
}
public int _getDocId() {
return 0;
}
public abstract double execute(ExplanationHolder explanation);
}

View File

@ -0,0 +1,22 @@
/* ###
* IP: GHIDRA
* NOTE: Dummy placeholder for elasticsearch class
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
package org.elasticsearch.script;
public final class ScriptContext<T> {
public final String name = null;
public final Class<T> factoryClazz = null;
}

View File

@ -0,0 +1,30 @@
/* ###
* IP: GHIDRA
* NOTE: Dummy placeholder for elasticsearch interface
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
package org.elasticsearch.script;
import java.io.Closeable;
import java.util.Map;
import java.util.Set;
public interface ScriptEngine extends Closeable {
String getType();
<FactoryType> FactoryType compile(String name, String code, ScriptContext<FactoryType> context,
Map<String, String> params);
Set<ScriptContext<?>> getSupportedContexts();
}

View File

@ -0,0 +1,22 @@
/* ###
* IP: GHIDRA
* NOTE: Dummy placeholder for elasticsearch class
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
package org.elasticsearch.script;
public interface ScriptFactory {
boolean isResultDeterministic();
}

View File

@ -0,0 +1,21 @@
/* ###
* IP: GHIDRA
* NOTE: Dummy placeholder for elasticsearch class
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
package org.elasticsearch.search.lookup;
public class SearchLookup {
}

View File

@ -0,0 +1,9 @@
##MODULE IP: Oxygen Icons - LGPL 3.0
MODULE FILE LICENSE: postgresql-15.3.tar.gz Postgresql License
MODULE FILE LICENSE: lib/postgresql-42.6.0.jar PostgresqlJDBC License
MODULE FILE LICENSE: lib/json-simple-1.1.1.jar Apache License 2.0
MODULE FILE LICENSE: lib/commons-dbcp2-2.9.0.jar Apache License 2.0
MODULE FILE LICENSE: lib/commons-pool2-2.11.1.jar Apache License 2.0
MODULE FILE LICENSE: lib/commons-logging-1.2.jar Apache License 2.0
MODULE FILE LICENSE: lib/log4j-jcl-2.16.0.jar Apache License 2.0
MODULE FILE LICENSE: lib/h2-2.2.220.jar H2 Mozilla License 2.0

197
Ghidra/Features/BSim/build.gradle Executable file
View File

@ -0,0 +1,197 @@
/* ###
* IP: GHIDRA
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
apply from: "$rootProject.projectDir/gradle/distributableGhidraModule.gradle"
apply from: "$rootProject.projectDir/gradle/javaProject.gradle"
apply from: "$rootProject.projectDir/gradle/javaTestProject.gradle"
apply from: "$rootProject.projectDir/gradle/nativeProject.gradle"
apply from: "$rootProject.projectDir/gradle/helpProject.gradle"
apply plugin: 'eclipse'
eclipse.project.name = 'Features BSim'
import java.nio.file.Files
import org.gradle.util.GUtil
// NOTE: fetchDependencies.gradle must be updated if postgresql version changes
def postgresql_distro = "postgresql-15.3.tar.gz"
dependencies {
api project(":Decompiler")
api project(":CodeCompare")
api "org.postgresql:postgresql:42.6.0"
api "org.json.simple:json-simple:1.1.1"
api "org.apache.commons:commons-dbcp2:2.9.0"
api "org.apache.commons:commons-pool2:2.11.1"
api "org.apache.commons:commons-logging:1.2"
api "org.apache.logging.log4j:log4j-jcl:2.16.0"
api "com.h2database:h2:2.2.220"
}
// Copy postgresql source distro, lshvector plugin source, and make-postgres.sh
// into common zip to allow for a rebuild of the postgres server if needed
rootProject.assembleDistribution {
String postgresqlDepsFile = "${DEPS_DIR}/BSim/${postgresql_distro}"
String postgresqlBinRepoFile = "${BIN_REPO}/Ghidra/Features/BSim/${postgresql_distro}"
def postgresqlFile = file(postgresqlDepsFile).exists() ? postgresqlDepsFile : postgresqlBinRepoFile
into (getZipPath(this.project)) {
from file("make-postgres.sh")
}
into (getZipPath(this.project)) {
from file(postgresqlFile)
}
into (getZipPath(this.project) + "/src/lshvector") {
from files("src/lshvector")
}
}
// Relative to the 'workingDir' Exec task property.
def installPoint = "../help/help"
/**
* Build the pdf docs for BSim and place into the '$installPoint' directory.
* A build (ex: 'gradle buildLocalTSSI_Release') will place the pdf in the distribution.
* There is an associated, auto-generated clean task.
**/
task buildBSimHelpPdf(type: Exec) {
workingDir 'src/main/doc'
def buildDir = "../../../build/BSimDocumentationPdf"
// Gradle will provide a cleanBuildBSimDocumentationPdf task that will remove these
// declared outputs.
outputs.dir "$workingDir/$buildDir"
outputs.file "$workingDir/$buildDir/bsim.pdf"
// 'which' returns the number of failed arguments
// Using the 'which' command first will allow the task to fail if the required
// executables are not installed.
//
// The bash commands end with "2>&1" to redirect stderr to stdout and have all
// messages print in sequence
//
// 'commandLine' takes one command, so wrap multiple commands in bash.
commandLine 'bash', '-e', '-c', """
echo '** Checking if required executables are installed. **'
which xsltproc
which fop
echo '** Preparing for xsltproc **'
mkdir -p $buildDir/images
cp $installPoint/topics/BSimDatabasePlugin/images/*.png $buildDir/images
echo '** Building bsim.fo **'
xsltproc --output $buildDir/bsim_withscaling.xml --stringparam profile.condition "withscaling" commonprofile.xsl bsim.xml 2>&1
xsltproc --output $buildDir/bsim.fo focustom.xsl $buildDir/bsim_withscaling.xml 2>&1
echo '** Building bsim.pdf **'
fop $buildDir/bsim.fo $buildDir/bsim.pdf 2>&1
echo '** Done. **'
"""
// Allows doLast block regardless of exit value.
ignoreExitValue true
// Store the output instead of printing to the console.
standardOutput = new ByteArrayOutputStream()
ext.output = { standardOutput.toString() }
ext.errorOutput = { standardOutput.toString() }
// Check the OS before executing command.
doFirst {
if (!getCurrentPlatformName().startsWith("linux")) {
throw new TaskExecutionException( it, new Exception("The '$it.name' task only works on Linux."))
}
}
// Print the output of the commands and check the return value.
doLast {
println output()
if (execResult.exitValue) {
logger.error("$it.name: An error occurred. Here is the output:\n" + output())
throw new TaskExecutionException( it, new Exception("'$it.name': The command: '${commandLine.join(' ')}'" +
" task \nfailed with exit code $execResult.exitValue; see task output for details."))
}
}
}
/**
* Build the html docs for BSim and place into the '$installPoint' directory.
* A build (ex: 'gradle buildLocalTSSI_Release') will place the html files in the distribution.
**/
task buildBSimHelpHtml(type: Exec) {
workingDir 'src/main/doc'
def buildDir = "../../../build/html"
// 'which' returns the number of failed arguments
// Using the 'which' command first will allow the task to fail if the required
// executables are not installed.
//
// The bash commands end with "2>&1" to redirect stderr to stdout and have all
// messages print in sequence
//
// 'commandLine' takes one command, so wrap multiple commands in bash.
commandLine 'bash', '-e', '-c', """
echo '** Checking if required executables are installed. **'
which xsltproc
which sed
echo '** Removing older html files installed under '$installPoint' **'
rm -f $installPoint/topics/BSimDatabasePlugin/*.html
echo '** Building html files **'
xsltproc --output $buildDir/bsim_noscaling.xml --stringparam profile.condition "noscaling" commonprofile.xsl bsim.xml 2>&1
xsltproc --stringparam base.dir ${installPoint}/topics/BSimDatabasePlugin/ htmlcustom.xsl $buildDir/bsim_noscaling.xml 2>&1
sed -i -e '/DefaultStyle.css/ { p; sQhref=".*"Qhref="../../shared/languages.css"Q; }' ${installPoint}/topics/BSimDatabasePlugin/*.html
rm $installPoint/topics/BSimDatabasePlugin/index.html
echo '** Done. **'
"""
// Allows doLast block regardless of exit value.
ignoreExitValue true
// Store the output instead of printing to the console.
standardOutput = new ByteArrayOutputStream()
ext.output = { standardOutput.toString() }
ext.errorOutput = { standardOutput.toString() }
// Check the OS before executing command.
doFirst {
if (!getCurrentPlatformName().startsWith("linux")) {
throw new TaskExecutionException( it, new Exception("The '$it.name' task only works on Linux."))
}
}
// Print the output of the commands and check the return value.
doLast {
println output()
if (execResult.exitValue) {
logger.error("$it.name: An error occurred. Here is the output:\n" + output())
throw new TaskExecutionException( it, new Exception("'$it.name': The command: '${commandLine.join(' ')}'" +
" task \nfailed with exit code $execResult.exitValue; see task output for details."))
}
}
}

View File

@ -0,0 +1,51 @@
##VERSION: 2.0
##MODULE IP: Apache License 2.0
##MODULE IP: Creative Commons Attribution 2.5
##MODULE IP: Crystal Clear Icons - LGPL 2.1
##MODULE IP: FAMFAMFAM Icons - CC 2.5
##MODULE IP: H2 Mozilla License 2.0
##MODULE IP: LGPL 2.1
##MODULE IP: LGPL 3.0
##MODULE IP: Oxygen Icons - LGPL 3.0
##MODULE IP: Postgresql License
##MODULE IP: PostgresqlJDBC License
##MODULE IP: Public Domain
Module.manifest||GHIDRA||||END|
data/bsim.theme.properties||GHIDRA||||END|
data/large_32.xml||GHIDRA||||END|
data/lshweights_32.xml||GHIDRA|||Signature data|END|
data/lshweights_64.xml||GHIDRA|||Signature data|END|
data/lshweights_64_32.xml||GHIDRA|||Signature data|END|
data/lshweights_cpool.xml||GHIDRA||||END|
data/lshweights_nosize.xml||GHIDRA||||END|
data/medium_32.xml||GHIDRA||||END|
data/medium_64.xml||GHIDRA||||END|
data/medium_cpool.xml||GHIDRA||||END|
data/medium_nosize.xml||GHIDRA||||END|
data/serverconfig.xml||GHIDRA||||END|
src/lshvector/Makefile.lshvector||GHIDRA||||END|
src/lshvector/lshvector--1.0.sql||GHIDRA||||END|
src/lshvector/lshvector.control||GHIDRA||||END|
src/main/help/help/TOC_Source.xml||GHIDRA||||END|
src/main/help/help/topics/BSim/BSimOverview.html||GHIDRA||||END|
src/main/help/help/topics/BSim/CommandLineReference.html||GHIDRA||||END|
src/main/help/help/topics/BSim/DatabaseConfiguration.html||GHIDRA||||END|
src/main/help/help/topics/BSim/FeatureWeight.html||GHIDRA||||END|
src/main/help/help/topics/BSim/IngestProcess.html||GHIDRA||||END|
src/main/help/help/topics/BSimSearchPlugin/BSimSearch.html||GHIDRA||||END|
src/main/help/help/topics/BSimSearchPlugin/images/AddServerDialog.png||GHIDRA||||END|
src/main/help/help/topics/BSimSearchPlugin/images/ApplyResultsPanel.png||GHIDRA||||END|
src/main/help/help/topics/BSimSearchPlugin/images/BSimOverviewDialog.png||GHIDRA||||END|
src/main/help/help/topics/BSimSearchPlugin/images/BSimOverviewResults.png||GHIDRA||||END|
src/main/help/help/topics/BSimSearchPlugin/images/BSimResultsProvider.png||GHIDRA||||END|
src/main/help/help/topics/BSimSearchPlugin/images/BSimSearchDialog.png||GHIDRA||||END|
src/main/help/help/topics/BSimSearchPlugin/images/ManageServersDialog.png||GHIDRA||||END|
src/main/resources/bsim.log4j.xml||GHIDRA||||END|
src/main/resources/images/checkmark_yellow.gif||GHIDRA||||END|
src/main/resources/images/flag_green.png||FAMFAMFAM Icons - CC 2.5|||famfamfam silk icon set|END|
src/main/resources/images/preferences-desktop-user-password.png||Oxygen Icons - LGPL 3.0|||Oxygen icon theme (dual license; LGPL or CC-SA-3.0)|END|
src/main/resources/images/preferences-web-browser-shortcuts-32.png||Oxygen Icons - LGPL 3.0|||Oxygen icon theme (dual license; LGPL or CC-SA-3.0)|END|
src/main/resources/images/preferences-web-browser-shortcuts.png||LGPL 3.0|||oxygen|END|
src/main/resources/images/view_top_bottom.png||Crystal Clear Icons - LGPL 2.1||||END|
src/main/resources/log4j-appender-console.xml||GHIDRA||||END|
src/main/resources/log4j-appender-rolling-file.xml||GHIDRA||||END|

View File

@ -0,0 +1,17 @@
[Defaults]
icon.bsim.query.dialog.provider = preferences-web-browser-shortcuts.png
icon.bsim.change.password = preferences-desktop-user-password.png
icon.bsim.table.split = view_top_bottom.png
icon.bsim.results.status.name.applied = checkmark_green.gif
icon.bsim.results.status.signature.applied = EMPTY_ICON {checkmark_green.gif[move(-2,-1)]} {checkmark_green.gif [move(4,0)]}
icon.bsim.results.status.matches = flag_green.png
icon.bsim.results.status.ignored = checkmark_yellow.gif
icon.bsim.functions.table = FunctionScope.gif
[Dark Defaults]

View File

@ -0,0 +1,13 @@
<dbconfig>
<info>
<name>Large 32-bit</name>
<owner>Example Owner</owner>
<description>A large (~100 million functions) database tuned for 32-bit executables</description>
<major>0</major>
<minor>0</minor>
<settings>0x49</settings>
</info>
<k>19</k>
<L>232</L>
<weightsfile>lshweights_32.xml</weightsfile>
</dbconfig>

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

View File

@ -0,0 +1,13 @@
<dbconfig>
<info>
<name>Medium 32-bit</name>
<owner>Example Owner</owner>
<description>A medium sized (~10 million functions) database tuned for 32-bit executables</description>
<major>0</major>
<minor>0</minor>
<settings>0x49</settings>
</info>
<k>17</k>
<L>146</L>
<weightsfile>lshweights_32.xml</weightsfile>
</dbconfig>

View File

@ -0,0 +1,13 @@
<dbconfig>
<info>
<name>Medium 64-bit</name>
<owner>Example Owner</owner>
<description>A medium sized (~10 million functions) database tuned for 64-bit executables</description>
<major>0</major>
<minor>0</minor>
<settings>0x49</settings>
</info>
<k>17</k>
<L>146</L>
<weightsfile>lshweights_64.xml</weightsfile>
</dbconfig>

View File

@ -0,0 +1,13 @@
<dbconfig>
<info>
<name>Medium JVM/Dalvik</name>
<owner>Example Owner</owner>
<description>A medium sized (~10 million functions) database tuned for java .class or .dex files</description>
<major>0</major>
<minor>0</minor>
<settings>0x49</settings>
</info>
<k>17</k>
<L>146</L>
<weightsfile>lshweights_cpool.xml</weightsfile>
</dbconfig>

View File

@ -0,0 +1,13 @@
<dbconfig>
<info>
<name>Medium No Size</name>
<owner>Example Owner</owner>
<description>A medium sized (~10 million functions) database tuned for executables with different address/register sizes</description>
<major>0</major>
<minor>0</minor>
<settings>0x4d</settings>
</info>
<k>17</k>
<L>146</L>
<weightsfile>lshweights_nosize.xml</weightsfile>
</dbconfig>

View File

@ -0,0 +1,14 @@
<serverconfig> <!-- Runtime parameters for the query server -->
<config key="shared_buffers">2GB</config> <!-- Amount of memory the server will use -->
<config key="work_mem">16MB</config> <!-- Max memory to use for hash tables and sorts -->
<config key="checkpoint_timeout">30min</config> <!-- Amount of time before all database records are flushed to disk -->
<config key="listen_addresses">'*'</config> <!-- '*' = all available, '0.0.0.0' just IPv4, 'localhost' -->
<config key="ssl">on</config> <!-- Enable server to connect via SSL -->
<!-- <config key="ssl_ciphers">TLSv1.2</config> -->
<config key="password_encryption">scram-sha-256</config>
<!-- <connect db="all" user="all" type="local" method="trust"/> -->
<connect db="all" user="all" addr="127.0.0.1/32" type="hostssl" method="trust"/>
<connect db="all" user="all" addr="::1/128" type="hostssl" method="trust"/>
<connect db="all" user="all" addr="all" type="hostssl" method="trust"/>
</serverconfig>

View File

@ -0,0 +1,175 @@
/* ###
* IP: GHIDRA
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
//Generate BSim signatures for the current program. The URL for the program is
//created from the local storage location. These signatures are intended for the
//in-memory database backend.
//@category BSim
import java.io.File;
import java.io.IOException;
import java.net.URL;
import java.util.Iterator;
import generic.lsh.vector.LSHVectorFactory;
import ghidra.app.script.GhidraScript;
import ghidra.features.base.values.GhidraValuesMap;
import ghidra.features.bsim.query.*;
import ghidra.features.bsim.query.BSimServerInfo.DBType;
import ghidra.features.bsim.query.FunctionDatabase.Error;
import ghidra.features.bsim.query.FunctionDatabase.ErrorCategory;
import ghidra.features.bsim.query.description.DatabaseInformation;
import ghidra.features.bsim.query.description.DescriptionManager;
import ghidra.features.bsim.query.file.BSimH2FileDBConnectionManager;
import ghidra.features.bsim.query.file.BSimH2FileDBConnectionManager.BSimH2FileDataSource;
import ghidra.features.bsim.query.protocol.*;
import ghidra.framework.model.DomainFolder;
import ghidra.framework.protocol.ghidra.GhidraURL;
import ghidra.program.model.listing.Function;
import ghidra.program.model.listing.FunctionManager;
import ghidra.util.MessageType;
import ghidra.util.Msg;
//@category BSim
//Generates and commits the BSim signatures for the currentProgram to the
//selected H2 BSim database
public class AddProgramToH2BSimDatabaseScript extends GhidraScript {
private static final String DATABASE = "H2 Database";
@Override
protected void run() throws Exception {
if (isRunningHeadless()) {
popup("Use the \"bsim\" command-line tool to add programs to a database headlessly");
return;
}
if (currentProgram == null) {
popup("This script requires that a program be open in the tool");
return;
}
GhidraValuesMap values = new GhidraValuesMap();
values.defineFile(DATABASE, null, new File(System.getProperty("user.home")));
values.setValidator((valueMap, status) -> {
File selected = valueMap.getFile(DATABASE);
if (selected.isDirectory() ||
!selected.getAbsolutePath().endsWith(BSimServerInfo.H2_FILE_EXTENSION)) {
status.setStatusText("Invalid Database File!", MessageType.ERROR);
return false;
}
return true;
});
askValues("Select Database File", null, values);
File h2DbFile = values.getFile(DATABASE);
FunctionDatabase h2Database = null;
try {
BSimServerInfo serverInfo =
new BSimServerInfo(DBType.file, null, 0, h2DbFile.getAbsolutePath());
h2Database = BSimClientFactory.buildClient(serverInfo, false);
BSimH2FileDataSource bds =
BSimH2FileDBConnectionManager.getDataSourceIfExists(h2Database.getServerInfo());
if (bds == null) {
popup(h2DbFile.getAbsolutePath() + " is not an H2 database file");
return;
}
if (bds.getActiveConnections() > 0) {
popup("There is an existing connection to the database.");
return;
}
h2Database.initialize();
DatabaseInformation dbInfo = h2Database.getInfo();
LSHVectorFactory vectorFactory = h2Database.getLSHVectorFactory();
GenSignatures gensig = null;
try {
gensig = new GenSignatures(dbInfo.trackcallgraph);
gensig.setVectorFactory(vectorFactory);
gensig.addExecutableCategories(dbInfo.execats);
gensig.addFunctionTags(dbInfo.functionTags);
gensig.addDateColumnName(dbInfo.dateColumnName);
DomainFolder df = currentProgram.getDomainFile().getParent();
URL folderURL = df.getSharedProjectURL();
if (folderURL == null) {
folderURL = df.getLocalProjectURL();
}
String path = GhidraURL.getProjectPathname(folderURL);
URL normalizedProjectURL = GhidraURL.getProjectURL(folderURL);
String repo = normalizedProjectURL.toExternalForm();
gensig.openProgram(this.currentProgram, null, null, null, repo, path);
final FunctionManager fman = currentProgram.getFunctionManager();
final Iterator<Function> iter = fman.getFunctions(true);
gensig.scanFunctions(iter, fman.getFunctionCount(), monitor);
final DescriptionManager manager = gensig.getDescriptionManager();
//need to call sortCallGraph on each FunctionDescription
//this de-dupes the list of callees for each function
//without this there can be SQL errors due to inserting duplicate
//entries into the callgraph table
manager.listAllFunctions().forEachRemaining(fd -> fd.sortCallgraph());
InsertRequest insertreq = new InsertRequest();
insertreq.manage = manager;
if (insertreq.execute(h2Database) == null) {
Error lastError = h2Database.getLastError();
if ((lastError.category == ErrorCategory.Format) ||
(lastError.category == ErrorCategory.Nonfatal)) {
Msg.showWarn(this, null, "Skipping Insert",
currentProgram.getName() + ": " + lastError.message);
return;
}
throw new IOException(currentProgram.getName() + ": " + lastError.message);
}
StringBuffer status = new StringBuffer(currentProgram.getName());
status.append(" added to database ");
status.append(dbInfo.databasename);
status.append("\n\n");
QueryExeCount exeCount = new QueryExeCount();
ResponseExe countResponse = exeCount.execute(h2Database);
if (countResponse != null) {
status.append(dbInfo.databasename);
status.append(" contains ");
status.append(countResponse.recordCount);
status.append(" executables.");
}
else {
status.append("null response from QueryExeCount");
}
popup(status.toString());
}
finally {
if (gensig != null) {
gensig.dispose();
}
}
}
finally {
if (h2Database != null) {
h2Database.close();
BSimH2FileDataSource bds =
BSimH2FileDBConnectionManager.getDataSourceIfExists(h2Database.getServerInfo());
bds.dispose();
}
}
}
}

View File

@ -0,0 +1,80 @@
/* ###
* IP: GHIDRA
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
// Calculate similarity/signifigance scores between executables by
// combining their function scores.
//@category BSim
import java.net.URL;
import ghidra.app.script.GhidraScript;
import ghidra.features.bsim.query.BSimClientFactory;
import ghidra.features.bsim.query.FunctionDatabase;
import ghidra.features.bsim.query.client.*;
import ghidra.features.bsim.query.description.ExecutableRecord;
public class CompareExecutables extends GhidraScript {
private ExecutableComparison exeCompare;
@Override
protected void run() throws Exception {
URL url = BSimClientFactory.deriveBSimURL("ghidra://localhost/repo");
try (FunctionDatabase database = BSimClientFactory.buildClient(url, true)) {
// FileScoreCaching cache = new FileScoreCaching("/tmp/test_scorecacher.txt");
TableScoreCaching cache = new TableScoreCaching(database);
exeCompare =
new ExecutableComparison(database, 1000000, "11111111111111111111111111111111",
cache,
monitor);
// Specify the list of executables to compare by giving their md5 hash
// exeCompare.addExecutable("22222222222222222222222222222222"); // 32 hex-digit string
// exeCompare.addExecutable("33333333333333333333333333333333");
exeCompare.addAllExecutables(5000);
ExecutableScorer scorer = exeCompare.getScorer();
if (!exeCompare.isConfigured()) {
exeCompare.resetThresholds(0.7, 10.0);
}
exeCompare.fillinSelfScores(); // Prefetch self-scores, calculate any we are missing
exeCompare.performScoring();
scorer.commitSelfScore(); // Commit the newly calculated self-score
println("Maximum cluster size = " + Integer.toString(exeCompare.getMaxHitCount()));
println("Hit count exceeded = " + Integer.toString(exeCompare.getExceedCount()));
float scoreThresh = 0.01f;
int numExe = scorer.numExecutables();
ExecutableRecord exeA = scorer.getSingularExecutable();
float selfScoreA = scorer.getSingularSelfScore();
for (int i = 1; i <= numExe; ++i) {
ExecutableRecord exeB = scorer.getExecutable(i);
float selfScoreB = scorer.getScore(i);
if (selfScoreB == 0.0f) { // This is possible if the executable has no "rare" functions.
continue; // as defined by the ExecutableComparison.hitCountThreshold
}
ExecutableRecord smallRecord = selfScoreA < selfScoreB ? exeA : exeB;
ExecutableRecord bigRecord = selfScoreA < selfScoreB ? exeB : exeA;
float libScore = scorer.getNormalizedScore(i, true);
float totalScore = scorer.getNormalizedScore(i, false);
if (libScore < scoreThresh) {
continue;
}
println(smallRecord.getNameExec() + " " + bigRecord.getNameExec());
println(" " + Float.toString(libScore) + " library score");
println(" " + Float.toString(totalScore) + " total score");
}
}
}
}

View File

@ -0,0 +1,148 @@
/* ###
* IP: GHIDRA
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
// Use the decompiler to generate a signature for the current function containing the cursor
// If we remember the last signature that was generated, compare this signature with
// the last signature and print the similarity
//@category BSim
import java.io.*;
import org.xml.sax.SAXException;
import generic.jar.ResourceFile;
import generic.lsh.vector.*;
import ghidra.app.decompiler.DecompInterface;
import ghidra.app.decompiler.DecompileOptions;
import ghidra.app.decompiler.signature.SignatureResult;
import ghidra.app.script.GhidraScript;
import ghidra.app.services.ProgramManager;
import ghidra.features.bsim.query.GenSignatures;
import ghidra.program.model.address.Address;
import ghidra.program.model.lang.LanguageID;
import ghidra.program.model.listing.Function;
import ghidra.program.model.listing.Program;
import ghidra.util.xml.SpecXmlUtils;
import ghidra.xml.NonThreadedXmlPullParserImpl;
import ghidra.xml.XmlPullParser;
public class CompareSignatures extends GhidraScript {
private LSHVectorFactory vectorFactory;
private LSHVector generateVector(Function f, Program program) {
DecompInterface decompiler = new DecompInterface();
decompiler.setOptions(new DecompileOptions());
decompiler.toggleSyntaxTree(false);
decompiler.setSignatureSettings(vectorFactory.getSettings());
if (!decompiler.openProgram(program)) {
println("Unable to initalize the Decompiler interface");
println(decompiler.getLastMessage());
return null;
}
SignatureResult sigres = decompiler.generateSignatures(f, false, 10, null);
LSHVector vec = vectorFactory.buildVector(sigres.features);
return vec;
}
private Program getProgram(Program[] progarray, String name) {
if ((name == null) || (progarray == null)) {
return null;
}
for (Program prog : progarray) {
if (name.equals(prog.getName())) {
return prog;
}
}
return null;
}
private static void readWeights(LSHVectorFactory vectorFactory, ResourceFile weightsFile)
throws FileNotFoundException, IOException, SAXException {
InputStream input = weightsFile.getInputStream();
XmlPullParser parser = new NonThreadedXmlPullParserImpl(input, "Vector weights parser",
SpecXmlUtils.getXmlHandler(), false);
vectorFactory.readWeights(parser);
input.close();
}
private void buildLSHVectorFactory() {
vectorFactory = new WeightedLSHCosineVectorFactory();
try {
LanguageID id = currentProgram.getLanguageID();
ResourceFile defaultWeightsFile = GenSignatures.getWeightsFile(id, id);
readWeights(vectorFactory, defaultWeightsFile);
}
catch (FileNotFoundException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
catch (SAXException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
@Override
protected void run() throws Exception {
Function func = this.getFunctionContaining(this.currentAddress);
if (func == null) {
return;
}
buildLSHVectorFactory();
LSHVector vec = generateVector(func, currentProgram);
ProgramManager programManager = state.getTool().getService(ProgramManager.class);
Program[] progarray = programManager.getAllOpenPrograms();
String lastprogram_string = System.getProperty("ghidra.lastprogram");
Program lastprogram = getProgram(progarray, lastprogram_string);
VectorCompare veccompare = new VectorCompare();
if (lastprogram != null) {
String addrstring = System.getProperty("ghidra.lastaddress");
if (addrstring != null) {
Address addr = lastprogram.getAddressFactory().getAddress(addrstring);
Function lastfunction = lastprogram.getFunctionManager().getFunctionAt(addr);
if (lastfunction != null) {
LSHVector lastvector = generateVector(lastfunction, lastprogram);
double sim = lastvector.compare(vec, veccompare);
double signif = vectorFactory.calculateSignificance(veccompare);
StringBuilder buf = new StringBuilder();
buf.append("Comparison results:\n");
buf.append(lastprogram.getName());
buf.append(".");
buf.append(lastfunction.getName());
buf.append(" vs. ");
buf.append(currentProgram.getName());
buf.append(".");
buf.append(func.getName());
buf.append("\n Similarity: ");
buf.append(Double.toString(sim));
buf.append("\n Significance: ");
buf.append(Double.toString(signif));
buf.append("\n");
lastvector.compareDetail(vec, buf);
println(buf.toString());
}
}
}
System.setProperty("ghidra.lastprogram", currentProgram.getName());
String addrstring = func.getEntryPoint().toString();
System.setProperty("ghidra.lastaddress", addrstring);
}
}

View File

@ -0,0 +1,155 @@
/* ###
* IP: GHIDRA
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
// Compare the BSim feature vectors of two functions.
//@category BSim
import java.io.*;
import org.xml.sax.SAXException;
import generic.jar.ResourceFile;
import generic.lsh.vector.*;
import ghidra.app.decompiler.DecompInterface;
import ghidra.app.decompiler.DecompileOptions;
import ghidra.app.decompiler.signature.SignatureResult;
import ghidra.app.script.GhidraScript;
import ghidra.app.services.ProgramManager;
import ghidra.framework.Application;
import ghidra.program.model.address.Address;
import ghidra.program.model.listing.Function;
import ghidra.program.model.listing.Program;
import ghidra.util.exception.CancelledException;
import ghidra.util.xml.SpecXmlUtils;
import ghidra.xml.NonThreadedXmlPullParserImpl;
import ghidra.xml.XmlPullParser;
public class CompareSignaturesSpecifyWeights extends GhidraScript {
private static final String DEFAULT_LSH_WEIGHTS_FILE = "lshweights_nosize.xml";
private LSHVectorFactory vectorFactory;
private LSHVector generateVector(Function f, Program program) {
DecompInterface decompiler = new DecompInterface();
decompiler.setOptions(new DecompileOptions());
decompiler.setSignatureSettings(vectorFactory.getSettings());
decompiler.toggleSyntaxTree(false);
if (!decompiler.openProgram(program)) {
println("Unable to initalize the Decompiler interface");
println(decompiler.getLastMessage());
return null;
}
SignatureResult sigres = decompiler.generateSignatures(f, false, 10, null);
LSHVector vec = vectorFactory.buildVector(sigres.features);
return vec;
}
private static void readWeights(LSHVectorFactory vectorFactory, ResourceFile weightsFile)
throws FileNotFoundException, IOException, SAXException {
InputStream input = weightsFile.getInputStream();
XmlPullParser parser = new NonThreadedXmlPullParserImpl(input, "Vector weights parser",
SpecXmlUtils.getXmlHandler(), false);
vectorFactory.readWeights(parser);
input.close();
}
private boolean buildLSHVectorFactory() {
vectorFactory = new WeightedLSHCosineVectorFactory();
try {
String weightsFile =
askString("Enter weights file name", "weights file", DEFAULT_LSH_WEIGHTS_FILE);
ResourceFile defaultWeightsFile = Application.findDataFileInAnyModule(weightsFile);
readWeights(vectorFactory, defaultWeightsFile);
}
catch (FileNotFoundException e) {
e.printStackTrace();
return false;
}
catch (IOException e) {
e.printStackTrace();
return false;
}
catch (SAXException e) {
e.printStackTrace();
return false;
}
catch (CancelledException e) {
return false;
}
return true;
}
private Program getProgram(Program[] progarray, String name) {
if ((name == null) || (progarray == null)) {
return null;
}
for (Program prog : progarray) {
if (name.equals(prog.getName())) {
return prog;
}
}
return null;
}
@Override
protected void run() throws Exception {
Function func = this.getFunctionContaining(this.currentAddress);
if (func == null) {
return;
}
if (!buildLSHVectorFactory()) {
return;
}
LSHVector vec = generateVector(func, currentProgram);
ProgramManager programManager = state.getTool().getService(ProgramManager.class);
Program[] progarray = programManager.getAllOpenPrograms();
String lastprogram_string = System.getProperty("ghidra.lastprogram");
Program lastprogram = getProgram(progarray, lastprogram_string);
VectorCompare veccompare = new VectorCompare();
if (lastprogram != null) {
String addrstring = System.getProperty("ghidra.lastaddress");
if (addrstring != null) {
Address addr = lastprogram.getAddressFactory().getAddress(addrstring);
Function lastfunction = lastprogram.getFunctionManager().getFunctionAt(addr);
if (lastfunction != null) {
LSHVector lastvector = generateVector(lastfunction, lastprogram);
double sim = lastvector.compare(vec, veccompare);
double signif = vectorFactory.calculateSignificance(veccompare);
StringBuilder buf = new StringBuilder();
buf.append("Comparison results:\n");
buf.append(lastprogram.getName());
buf.append(".");
buf.append(lastfunction.getName());
buf.append(" vs. ");
buf.append(currentProgram.getName());
buf.append(".");
buf.append(func.getName());
buf.append("\n Similarity: ");
buf.append(Double.toString(sim));
buf.append("\n Significance: ");
buf.append(Double.toString(signif));
buf.append("\n");
lastvector.compareDetail(vec, buf);
println(buf.toString());
}
}
}
System.setProperty("ghidra.lastprogram", currentProgram.getName());
String addrstring = func.getEntryPoint().toString();
System.setProperty("ghidra.lastaddress", addrstring);
}
}

View File

@ -0,0 +1,170 @@
/* ###
* IP: GHIDRA
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
//Creates an empty file-based H2 BSim database
//@category BSim
import java.io.File;
import java.io.IOException;
import java.util.*;
import org.apache.commons.lang3.StringUtils;
import ghidra.app.script.GhidraScript;
import ghidra.features.base.values.GhidraValuesMap;
import ghidra.features.bsim.query.*;
import ghidra.features.bsim.query.BSimServerInfo.DBType;
import ghidra.features.bsim.query.FunctionDatabase.Error;
import ghidra.features.bsim.query.description.DatabaseInformation;
import ghidra.features.bsim.query.file.BSimH2FileDBConnectionManager;
import ghidra.features.bsim.query.file.BSimH2FileDBConnectionManager.BSimH2FileDataSource;
import ghidra.features.bsim.query.protocol.*;
import ghidra.util.MessageType;
import ghidra.util.Msg;
public class CreateH2BSimDatabaseScript extends GhidraScript {
private static final String NAME = "Database Name";
private static final String DIRECTORY = "Database Directory";
private static final String DATABASE_TEMPLATE = "Database Template";
private static final String FUNCTION_TAGS = "Function Tags (CSV)";
private static final String EXECUTABLE_CATEGORIES = "Executable Categories (CSV)";
private static final String[] templates =
{ "medium_nosize", "medium_32", "medium_64", "medium_cpool" };
@Override
protected void run() throws Exception {
if (isRunningHeadless()) {
popup("Use \"bsim\" to create an H2 BSim database from the command line");
return;
}
GhidraValuesMap values = new GhidraValuesMap();
values.defineString(NAME, "");
values.defineDirectory(DIRECTORY, new File(System.getProperty("user.home")));
values.defineChoice(DATABASE_TEMPLATE, "medium_nosize", templates);
values.defineString(FUNCTION_TAGS);
values.defineString(EXECUTABLE_CATEGORIES);
values.setValidator((valueMap, status) -> {
String databaseName = valueMap.getString(NAME);
if (StringUtils.isBlank(databaseName)) {
status.setStatusText("Name must be filled in!", MessageType.ERROR);
return false;
}
File directory = valueMap.getFile(DIRECTORY);
if (!directory.isDirectory()) {
status.setStatusText("Invalid directory!", MessageType.ERROR);
return false;
}
File dbFile = new File(directory, databaseName);
File testFile = new File(dbFile.getPath() + BSimServerInfo.H2_FILE_EXTENSION);
if (testFile.exists()) {
status.setStatusText("Database file already exists!", MessageType.ERROR);
return false;
}
return true;
});
askValues("Enter Database Parameters",
"Enter values required to create a new BSim H2 database.", values);
FunctionDatabase h2Database = null;
try {
String databaseName = values.getString(NAME);
File dbDir = values.getFile(DIRECTORY);
String template = values.getChoice(DATABASE_TEMPLATE);
String functionTagsCSV = values.getString(FUNCTION_TAGS);
List<String> tags = parseCSV(functionTagsCSV);
String exeCatCSV = values.getString(EXECUTABLE_CATEGORIES);
List<String> cats = parseCSV(exeCatCSV);
File dbFile = new File(dbDir, databaseName);
BSimServerInfo serverInfo =
new BSimServerInfo(DBType.file, null, 0, dbFile.getAbsolutePath());
h2Database = BSimClientFactory.buildClient(serverInfo, false);
BSimH2FileDataSource bds =
BSimH2FileDBConnectionManager.getDataSourceIfExists(h2Database.getServerInfo());
if (bds.getActiveConnections() > 0) {
//if this happens, there is a connection to the database but the
//database file was deleted
Msg.showError(this, null, "Connection Error",
"There is an existing connection to the database!");
return;
}
CreateDatabase command = new CreateDatabase();
command.info = new DatabaseInformation();
// Put in fields provided on the command line
// If they are null, the template will fill them in
command.info.databasename = databaseName;
command.config_template = template;
command.info.trackcallgraph = true;
ResponseInfo response = command.execute(h2Database);
if (response == null) {
throw new IOException(h2Database.getLastError().message);
}
for (String tag : tags) {
InstallTagRequest req = new InstallTagRequest();
req.tag_name = tag;
ResponseInfo resp = req.execute(h2Database);
if (resp == null) {
Error lastError = h2Database.getLastError();
throw new LSHException(lastError.message);
}
}
for (String cat : cats) {
InstallCategoryRequest req = new InstallCategoryRequest();
req.type_name = cat;
ResponseInfo resp = req.execute(h2Database);
if (resp == null) {
Error lastError = h2Database.getLastError();
throw new LSHException(lastError.message);
}
}
popup("Database " + values.getString(NAME) + " created successfully!");
}
finally {
if (h2Database != null) {
h2Database.close();
BSimH2FileDataSource bds =
BSimH2FileDBConnectionManager.getDataSourceIfExists(h2Database.getServerInfo());
bds.dispose();
}
}
}
//this de-dupes
private List<String> parseCSV(String csv) {
Set<String> parsed = new HashSet<>();
if (StringUtils.isEmpty(csv)) {
return new ArrayList<String>();
}
String[] parts = csv.split(",");
for (String p : parts) {
if (!StringUtils.isBlank(p)) {
parsed.add(p.trim());
}
}
List<String> res = new ArrayList<>(parsed);
res.sort(String::compareTo);
return res;
}
}

View File

@ -0,0 +1,72 @@
/* ###
* IP: GHIDRA
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
import java.util.List;
import ghidra.app.decompiler.DecompInterface;
import ghidra.app.decompiler.DecompileOptions;
import ghidra.app.decompiler.signature.DebugSignature;
import ghidra.app.script.GhidraScript;
import ghidra.program.model.lang.Language;
import ghidra.program.model.listing.Function;
public class DebugSignatures extends GhidraScript {
private static final int SIGNATURE_SETTINGS = 0x45;
@Override
protected void run() throws Exception {
Function func = this.getFunctionContaining(this.currentAddress);
if (func == null) {
popup("No function selected!");
return;
}
DecompInterface decompiler = new DecompInterface();
decompiler.setOptions(new DecompileOptions());
decompiler.toggleSyntaxTree(false);
decompiler.setSignatureSettings(SIGNATURE_SETTINGS);
if (!decompiler.openProgram(this.currentProgram)) {
println("Unable to initalize the Decompiler interface");
println(decompiler.getLastMessage());
return;
}
Language language = this.currentProgram.getLanguage();
List<DebugSignature> sigres = decompiler.debugSignatures(func, 10, null);
StringBuffer buf = new StringBuffer();
buf.append("\nFunction: ");
buf.append(func.getName());
buf.append("\nentry: ");
buf.append(func.getEntryPoint().toString());
buf.append("\n\n");
if (sigres == null) {
printf("Null sigres!\n");
}
else {
for (int i = 0; i < sigres.size(); ++i) {
sigres.get(i).printRaw(language, buf);
buf.append("\n");
}
}
printf("%s\n", buf.toString());
decompiler.closeProgram();
decompiler.dispose();
}
}

View File

@ -0,0 +1,61 @@
## ###
# IP: GHIDRA
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
##
# Use the decompiler to generate signatures for the function at the current address, then dump the
# signature hashes and debug information to the console
# @category: BSim.python
import ghidra.app.decompiler.tracking.DecompInterfaceTracking as DecompInterfaceTracking
import ghidra.app.decompiler.DecompileOptions as DecompileOptions
import generic.lsh.vector.WeightedLSHCosineVectorFactory as WeightedLSHCosineVectorFactory
import ghidra.query.GenSignatures as GenSignatures
import ghidra.xml.NonThreadedXmlPullParserImpl as NonThreadedXmlPullParserImpl
import ghidra.util.xml.SpecXmlUtils as SpecXmlUtils
def processFunction(func):
decompiler = DecompInterfaceTracking()
options = DecompileOptions()
decompiler.setOptions(options)
decompiler.toggleSyntaxTree(False)
decompiler.setSignatureSettings(getSettings())
if not decompiler.openProgram(currentProgram):
print "Unable to initialize the Decompiler interface!"
print "%s" % decompiler.getLastMessage()
return
language = currentProgram.getLanguage()
sigres = decompiler.debugSignatures(func,10,None)
for i,res in enumerate(sigres):
buf = java.lang.StringBuffer()
sigres.get(i).printRaw(language,buf)
print "%s" % buf.toString()
decompiler.closeProgram()
decompiler.dispose()
def getSettings():
vectorFactory = WeightedLSHCosineVectorFactory()
id = currentProgram.getLanguageID()
defaultWeightsFile = GenSignatures.getWeightsFile(id,id)
input = defaultWeightsFile.getInputStream()
parser = NonThreadedXmlPullParserImpl(input,"Vector weights parser", SpecXmlUtils.getXmlHandler(),False)
vectorFactory.readWeights(parser)
input.close()
return vectorFactory.getSettings()
func = currentProgram.getFunctionManager().getFunctionContaining(currentAddress)
if func is None:
print "no function at current address"
else:
processFunction(func)

View File

@ -0,0 +1,115 @@
/* ###
* IP: GHIDRA
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
// Use the decompiler to generate signatures for the function currently containing the cursor
// and dump the signature hashes to the console
//@category BSim
import java.io.*;
import java.util.List;
import org.xml.sax.SAXException;
import generic.jar.ResourceFile;
import generic.lsh.vector.LSHVectorFactory;
import generic.lsh.vector.WeightedLSHCosineVectorFactory;
import ghidra.app.decompiler.DecompInterface;
import ghidra.app.decompiler.DecompileOptions;
import ghidra.app.decompiler.signature.DebugSignature;
import ghidra.app.decompiler.signature.SignatureResult;
import ghidra.app.script.GhidraScript;
import ghidra.features.bsim.query.GenSignatures;
import ghidra.program.model.lang.Language;
import ghidra.program.model.lang.LanguageID;
import ghidra.program.model.listing.Function;
import ghidra.util.xml.SpecXmlUtils;
import ghidra.xml.NonThreadedXmlPullParserImpl;
import ghidra.xml.XmlPullParser;
public class DumpSignatures extends GhidraScript {
private LSHVectorFactory vectorFactory;
@Override
public void run() throws Exception {
Function func = this.getFunctionContaining(this.currentAddress);
if (func == null) {
return;
}
buildLSHVectorFactory();
boolean debug = false;
DecompInterface decompiler = new DecompInterface();
decompiler.setOptions(new DecompileOptions());
decompiler.setSignatureSettings(vectorFactory.getSettings());
decompiler.toggleSyntaxTree(false);
if (!decompiler.openProgram(this.currentProgram)) {
println("Unable to initalize the Decompiler interface");
println(decompiler.getLastMessage());
return;
}
if (!debug) {
SignatureResult sigres = decompiler.generateSignatures(func, false, 10, null);
StringBuffer buf = new StringBuffer("\n");
for (int feature : sigres.features) {
buf.append(Integer.toHexString(feature));
buf.append("\n");
}
println(buf.toString());
}
else {
Language language = this.currentProgram.getLanguage();
List<DebugSignature> sigres = decompiler.debugSignatures(func, 10, null);
StringBuffer buf = new StringBuffer("\n");
for (int i = 0; i < sigres.size(); ++i) {
sigres.get(i).printRaw(language, buf);
buf.append("\n");
}
println(buf.toString());
}
decompiler.closeProgram();
decompiler.dispose();
}
private static void readWeights(LSHVectorFactory vectorFactory, ResourceFile weightsFile)
throws FileNotFoundException, IOException, SAXException {
InputStream input = weightsFile.getInputStream();
XmlPullParser parser = new NonThreadedXmlPullParserImpl(input, "Vector weights parser",
SpecXmlUtils.getXmlHandler(), false);
vectorFactory.readWeights(parser);
input.close();
}
private void buildLSHVectorFactory() {
vectorFactory = new WeightedLSHCosineVectorFactory();
try {
LanguageID id = currentProgram.getLanguageID();
ResourceFile defaultWeightsFile = GenSignatures.getWeightsFile(id, id);
readWeights(vectorFactory, defaultWeightsFile);
}
catch (FileNotFoundException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
catch (SAXException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}

View File

@ -0,0 +1,61 @@
## ###
# IP: GHIDRA
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
##
# Use the decompiler to generate signatures for the function at the current address, then dump the
# signature hashes to the console
# @category: BSim.python
import ghidra.app.decompiler.tracking.DecompInterfaceTracking as DecompInterfaceTracking
import ghidra.app.decompiler.DecompileOptions as DecompileOptions
import generic.lsh.vector.WeightedLSHCosineVectorFactory as WeightedLSHCosineVectorFactory
import ghidra.query.GenSignatures as GenSignatures
import ghidra.xml.NonThreadedXmlPullParserImpl as NonThreadedXmlPullParserImpl
import ghidra.util.xml.SpecXmlUtils as SpecXmlUtils
def processFunction(func):
decompiler = ghidra.app.decompiler.tracking.DecompInterfaceTracking()
options = ghidra.app.decompiler.DecompileOptions()
decompiler.setOptions(options)
decompiler.toggleSyntaxTree(False)
decompiler.setSignatureSettings(getSettings())
if not decompiler.openProgram(currentProgram):
print "Unable to initialize the Decompiler interface!"
print "%s" % decompiler.getLastMessage()
return
sigres = decompiler.generateSignatures(func, False, 10, None)
buf = java.lang.StringBuffer()
for i,res in enumerate(sigres.features):
buf.append(java.lang.Integer.toHexString(sigres.features[i]))
buf.append("\n")
print buf.toString()
decompiler.closeProgram()
decompiler.dispose()
def getSettings():
vectorFactory = WeightedLSHCosineVectorFactory()
id = currentProgram.getLanguageID()
defaultWeightsFile = GenSignatures.getWeightsFile(id,id)
input = defaultWeightsFile.getInputStream()
parser = NonThreadedXmlPullParserImpl(input,"Vector weights parser", SpecXmlUtils.getXmlHandler(),False)
vectorFactory.readWeights(parser)
input.close()
return vectorFactory.getSettings()
func = currentProgram.getFunctionManager().getFunctionContaining(currentAddress)
if func is None:
print "no function at current address"
else:
processFunction(func)

View File

@ -0,0 +1,69 @@
/* ###
* IP: GHIDRA
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
//Example of how to perform an overview query in a script.
//@category BSim
import java.util.HashSet;
import generic.lsh.vector.LSHVectorFactory;
import ghidra.app.script.GhidraScript;
import ghidra.features.bsim.query.facade.SFOverviewInfo;
import ghidra.features.bsim.query.facade.SimilarFunctionQueryService;
import ghidra.features.bsim.query.protocol.ResponseNearestVector;
import ghidra.features.bsim.query.protocol.SimilarityVectorResult;
import ghidra.program.database.symbol.FunctionSymbol;
import ghidra.program.model.listing.*;
public class ExampleOverviewQuery extends GhidraScript {
private static final double SIMILARITY_BOUND = 0.7;
private static final double SIGNIFICANCE_BOUND = 0.0;
@Override
protected void run() throws Exception {
Program queryingProgram = currentProgram;
HashSet<FunctionSymbol> funcsToQuery = new HashSet<>();
FunctionIterator fIter = queryingProgram.getFunctionManager().getFunctionsNoStubs(true);
for (Function func : fIter){
funcsToQuery.add((FunctionSymbol) func.getSymbol());
}
SFOverviewInfo overviewInfo = new SFOverviewInfo(funcsToQuery);
overviewInfo.setSimilarityThreshold(SIMILARITY_BOUND);
overviewInfo.setSignificanceThreshold(SIGNIFICANCE_BOUND);
try (SimilarFunctionQueryService queryService =
new SimilarFunctionQueryService(queryingProgram)) {
String DATABASE_URL = askString("Enter database URL", "URL:");
queryService.initializeDatabase(DATABASE_URL);
LSHVectorFactory vectorFactory = queryService.getLSHVectorFactory();
ResponseNearestVector overviewResults =
queryService.overviewSimilarFunctions(overviewInfo, null, monitor);
StringBuilder buf = new StringBuilder();
buf.append("\n");
for (SimilarityVectorResult result : overviewResults.result) {
buf.append("Name: ").append(result.getBase().getFunctionName()).append("\n");
buf.append("Hit Count: ").append(result.getTotalCount()).append("\n");
buf.append("Self-significance: ");
buf.append(vectorFactory
.getSelfSignificance(result.getBase().getSignatureRecord().getLSHVector()));
buf.append("\n\n");
}
printf("%s\n", buf.toString());
}
}
}

View File

@ -0,0 +1,47 @@
## ###
# IP: GHIDRA
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
##
# Example of how to perform an overview query in a script
# @category BSim.python
import ghidra.query.facade.SFOverviewInfo as SFOverviewInfo
import ghidra.query.facade.SimilarFunctionQueryService as SimilarFunctionQueryService
import java.util.HashSet
SIMILARITY_BOUND = 0.7
SIGNIFICANCE_BOUND = 0.0
funcsToQuery = java.util.HashSet()
fIter = currentProgram.getFunctionManager().getFunctionsNoStubs(True)
for func in fIter:
funcsToQuery.add(func.getSymbol())
overviewInfo = SFOverviewInfo(funcsToQuery)
overviewInfo.setSimilarityThreshold(SIMILARITY_BOUND)
overviewInfo.setSignificanceThreshold(SIGNIFICANCE_BOUND)
queryService = SimilarFunctionQueryService(currentProgram)
DB_URL = askString("Enter database URL", "URL:")
queryService.initializeDatabase(DB_URL)
vectorFactory = queryService.getLSHVectorFactory()
overviewResults = queryService.overviewSimilarFunctions(overviewInfo, monitor)
for result in overviewResults.result:
print "Name: %s" % result.getBase().getFunctionName()
print "Hit Count: %d" % result.getTotalCount()
print "Self-significance: %f\n" % vectorFactory.getSelfSignificance(result.getBase().getSignatureRecord().getLSHVector())
queryService.dispose()

View File

@ -0,0 +1,83 @@
/* ###
* IP: GHIDRA
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
// Example of connecting to a BSim server and requesting executable and function records
//@category BSim
import java.io.StringWriter;
import java.net.URL;
import java.util.List;
import ghidra.app.script.GhidraScript;
import ghidra.features.bsim.query.BSimClientFactory;
import ghidra.features.bsim.query.FunctionDatabase;
import ghidra.features.bsim.query.description.*;
import ghidra.features.bsim.query.protocol.*;
import ghidra.util.Msg;
public class ExampleQueryClient extends GhidraScript {
@Override
protected void run() throws Exception {
URL url = BSimClientFactory.deriveBSimURL("ghidra://localhost/repo");
try (FunctionDatabase client = BSimClientFactory.buildClient(url, false)) {
if (!client.initialize()) {
Msg.error(this, "Unable to connect to server");
return;
}
QueryInfo query = new QueryInfo();
ResponseInfo resp = query.execute(client);
StringWriter write = new StringWriter();
resp.saveXml(write);
write.flush();
QueryName exequery = new QueryName();
exequery.spec.exename = "libdocdoxygenplugin.so";
ResponseName respname = exequery.execute(client);
if (respname == null) {
Msg.error(this, client.getLastError());
return;
}
ExecutableRecord erec = respname.manage.getExecutableRecordSet().first();
FunctionDescription funcrec =
respname.manage.findFunctionByName("DocDoxygenPlugin::createCatalog", erec);
QueryChildren childquery = new QueryChildren();
childquery.md5sum = funcrec.getExecutableRecord().getMd5();
childquery.functionKeys.add(new FunctionEntry(funcrec));
ResponseChildren respchild = childquery.execute(client);
if (respchild == null) {
Msg.error(this, client.getLastError());
return;
}
for (int i = 0; i < respchild.correspond.size(); ++i) {
FunctionDescription func = respchild.correspond.get(i);
List<CallgraphEntry> callgraphRecord = func.getCallgraphRecord();
if (callgraphRecord != null) {
for (int j = 0; j < callgraphRecord.size(); ++j) {
write.write(
callgraphRecord.get(j).getFunctionDescription().getFunctionName());
write.write('\n');
}
}
}
write.flush();
Msg.info(this, write.toString());
}
}
}

View File

@ -0,0 +1,73 @@
/* ###
* IP: GHIDRA
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
// Generate signatures for every function in the current executable and write in XML form to
// a user specified file.
//@category BSim
import java.io.*;
import java.util.Iterator;
import generic.lsh.vector.LSHVectorFactory;
import ghidra.app.script.GhidraScript;
import ghidra.features.bsim.query.FunctionDatabase;
import ghidra.features.bsim.query.GenSignatures;
import ghidra.features.bsim.query.client.Configuration;
import ghidra.features.bsim.query.description.DescriptionManager;
import ghidra.program.model.listing.Function;
import ghidra.program.model.listing.FunctionManager;
public class GenerateSignatures extends GhidraScript {
@Override
public void run() throws Exception {
final String md5string = currentProgram.getExecutableMD5();
if ((md5string == null) || (md5string.length() < 10)) {
throw new IOException("Could not get MD5 on file: " + currentProgram.getName());
}
final String basename = "sigs_" + md5string;
System.setProperty("ghidra.output", basename); // Inform parallel controller of output name
File file = null;
// This form of askString will work for both standalone execution or for parallel
final File workingdir = askDirectory("GenerateSignatures:", "Working directory");
if (!workingdir.isDirectory()) {
popup("Must select a working directory!");
return;
}
file = new File(workingdir, basename);
final LSHVectorFactory vectorFactory = FunctionDatabase.generateLSHVectorFactory();
final GenSignatures gensig = new GenSignatures(true);
final String templatename =
askString("GenerateSignatures:", "Database template", "medium_nosize");
final Configuration config = FunctionDatabase.loadConfigurationTemplate(templatename);
vectorFactory.set(config.weightfactory, config.idflookup, config.info.settings);
gensig.setVectorFactory(vectorFactory);
gensig.addExecutableCategories(config.info.execats);
gensig.addFunctionTags(config.info.functionTags);
gensig.addDateColumnName(config.info.dateColumnName);
final String repo = "ghidra://localhost/" + state.getProject().getName();
final String path = GenSignatures.getPathFromDomainFile(currentProgram);
gensig.openProgram(this.currentProgram, null, null, null, repo, path);
final FunctionManager fman = currentProgram.getFunctionManager();
final Iterator<Function> iter = fman.getFunctions(true);
gensig.scanFunctions(iter, fman.getFunctionCount(), monitor);
final FileWriter fwrite = new FileWriter(file);
final DescriptionManager manager = gensig.getDescriptionManager();
manager.saveXml(fwrite);
fwrite.close();
}
}

View File

@ -0,0 +1,58 @@
## ###
# IP: GHIDRA
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
##
#Generate signatures for every function in the current program and write them to an XML file in a user-specified directory
#@category BSim.python
import java.lang.System as System
import java.io.File as File
import ghidra.query.FunctionDatabase as FunctionDatabase
import ghidra.query.GenSignatures as GenSignatures
import java.io.FileWriter as FileWriter
def run():
md5String = currentProgram.getExecutableMD5()
if (md5String is None) or (len(md5String) < 10):
raise IOException("Could not get MD5 on file: " + currentProgram.getName())
basename = "sigs_" + md5String
System.setProperty("ghidra.output",basename)
workingDir = askDirectory("GenerateSignatures:", "Working Directory")
if not workingDir.isDirectory():
popup("Must select a working directory")
return
outfile = File(workingDir,basename)
vectorFactory = FunctionDatabase.generateLSHVectorFactory()
gensig = GenSignatures(True)
templateName = askString("GenerateSignatures:", "Database template", "medium_nosize")
config = FunctionDatabase.loadConfigurationTemplate(templateName)
vectorFactory.set(config.weightfactory, config.idflookup, config.info.settings)
gensig.setVectorFactory(vectorFactory)
gensig.addExecutableCategories(config.info.execats)
gensig.addFunctionTags(config.info.functionTags)
gensig.addDateColumnName(config.info.dateColumnName)
repo = "ghidra://localhost/" + state.getProject().getName()
path = GenSignatures.getPathFromDomainFile(currentProgram)
gensig.openProgram(currentProgram,None,None,None,repo,path)
fman = currentProgram.getFunctionManager()
iter = fman.getFunctions(True)
gensig.scanFunctions(iter, fman.getFunctionCount(), monitor)
fwrite = FileWriter(outfile)
manager = gensig.getDescriptionManager()
manager.saveXml(fwrite)
fwrite.close()
return
run()

View File

@ -0,0 +1,443 @@
/* ###
* IP: GHIDRA
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
//Queries all functions in the current selection (or all functions in the current program if
//the current selection is null) against all functions in a user-selected program.
//@category BSim
import java.util.*;
import org.apache.commons.collections4.IteratorUtils;
import generic.lsh.vector.*;
import ghidra.app.decompiler.DecompileException;
import ghidra.app.plugin.core.functioncompare.FunctionComparisonProvider;
import ghidra.app.script.GhidraScript;
import ghidra.app.services.FunctionComparisonService;
import ghidra.app.tablechooser.*;
import ghidra.features.bsim.query.*;
import ghidra.features.bsim.query.client.Configuration;
import ghidra.features.bsim.query.description.FunctionDescription;
import ghidra.program.model.address.Address;
import ghidra.program.model.listing.*;
//TODO: docs
public class LocalBSimQueryScript extends GhidraScript {
//functions with self significance below this bound will be skipped
private static final double SELF_SIGNIFICANCE_BOUND = 15.0;
//bsim database template determining the signature settings
private static final String TEMPLATE_NAME = "medium_nosize";
//these are analogous to the bounds in a bsim query
private static final double MATCH_SIMILARITY_LOWER_BOUND = 0.0;
private static final double MATCH_CONFIDENCE_LOWER_BOUND = 0.0;
private static final int MATCHES_PER_FUNCTION = 10;
//decrease this if you only want to see matches that aren't exact
//for instance, when looking for changes between two versions of a program
private static final double MATCH_SIMILARITY_UPPER_BOUND = 1.0;
private TableChooserDialog tableDialog;
@Override
protected void run() throws Exception {
if (isRunningHeadless()) {
popup("This script cannot be run headlessly.");
return;
}
Set<Function> sourceFuncs = new HashSet<>();
if (currentSelection == null) {
IteratorUtils.forEach(currentProgram.getFunctionManager().getFunctions(true),
x -> sourceFuncs.add(x));
}
else {
IteratorUtils.forEach(
currentProgram.getFunctionManager().getFunctionsOverlapping(currentSelection),
x -> sourceFuncs.add(x));
}
if (sourceFuncs.isEmpty()) {
this.popup("No non-stub functions to query!");
return;
}
Program targetProgram = askProgram("Select Target Program");
if (targetProgram == null) {
return;
}
try {
List<LocalBSimMatch> localMatches = null;
//use special optimized method when the target program is the same as the current program
//in that case, a given function might be in both the source and target sets
//but we only want to generate signatures for it once
if (currentProgram.getUniqueProgramID() == targetProgram.getUniqueProgramID()) {
localMatches = getMatchesCurrentProgram(sourceFuncs);
}
else {
//in this case there is no overlap between the source and target functions
localMatches = getMatchesTwoPrograms(sourceFuncs, currentProgram, targetProgram);
}
if (localMatches.isEmpty()) {
popup("No matches meeting criteria.");
return;
}
Collections.sort(localMatches);
initializeTable(currentProgram, targetProgram);
//again, use an optimized method for the special case when target program is the same
//as the current program
if (currentProgram.getUniqueProgramID() == targetProgram.getUniqueProgramID()) {
addMatchesOneProgram(localMatches, sourceFuncs);
}
else {
addMatchesTwoPrograms(localMatches);
}
}
finally {
targetProgram.release(this);
}
}
/**
* Iterate through the list of sorted matches, adding the top MATCHES_PER_FUNCTION elements
* to the table for each source function.
* @param localMatches matches in decreasing order of confidence
*/
private void addMatchesTwoPrograms(List<LocalBSimMatch> localMatches) {
Map<Function, Integer> matchCounts = new HashMap<>();
for (LocalBSimMatch match : localMatches) {
int count = matchCounts.getOrDefault(match.getSourceFunc(), 0);
if (count >= MATCHES_PER_FUNCTION) {
continue;
}
tableDialog.add(match);
matchCounts.put(match.getSourceFunc(), count + 1);
}
}
/**
* Iterate through the list of sorted matches, adding the top MATCHES_PER_FUNCTION elements
* to the table for each function ins {@code sourceFuncSet}.
*
* By construction, the matches in this list have the "source" function before the "target"
* function (in address order). This is an optimization to prevent essentially the same
* data from appearing in the list twice (since the BSim similarity and confidence operations
* are commutative). So, for each match, we need to check whether the source or the
* target are in {@code sourceFuncSet}.
*
* @param localMatches matches in decreasing order of confidence
* @param sourceFuncSet source functions
*/
private void addMatchesOneProgram(List<LocalBSimMatch> localMatches,
Set<Function> sourceFuncSet) {
Map<Function, Integer> matchCounts = new HashMap<>();
for (LocalBSimMatch match : localMatches) {
Function leftFunc = match.getSourceFunc();
int leftCount = matchCounts.getOrDefault(leftFunc, 0);
if (sourceFuncSet.contains(leftFunc) && leftCount < MATCHES_PER_FUNCTION) {
tableDialog.add(match);
matchCounts.put(leftFunc, leftCount + 1);
}
Function rightFunc = match.getTargetFunc();
int rightCount = matchCounts.getOrDefault(rightFunc, 0);
if (sourceFuncSet.contains(rightFunc) && rightCount < MATCHES_PER_FUNCTION) {
LocalBSimMatch switched = new LocalBSimMatch(rightFunc, leftFunc,
match.getSimilarity(), match.getSignificance());
tableDialog.add(switched);
matchCounts.put(rightFunc, rightCount + 1);
}
}
}
private List<LocalBSimMatch> getMatchesCurrentProgram(Set<Function> funcs)
throws LSHException, DecompileException {
List<LocalBSimMatch> bsimMatches = new ArrayList<>();
LSHVectorFactory vectorFactory = getVectorFactory();
//generate the signatures for *all* functions in the program...
FunctionManager fman = currentProgram.getFunctionManager();
Iterator<Function> iter = fman.getFunctions(true);
GenSignatures gensig =
generateSignatures(currentProgram, iter, fman.getFunctionCount(), vectorFactory);
//...but use sourceFuncAddrs to ensure that source functions are in the
//funcs set
Set<Long> sourceFuncAddrs = new HashSet<>();
for (Function func : funcs) {
sourceFuncAddrs.add(func.getEntryPoint().getOffset());
}
Iterator<FunctionDescription> sourceDescripts =
gensig.getDescriptionManager().listAllFunctions();
VectorCompare vecCompare = new VectorCompare();
while (sourceDescripts.hasNext()) {
FunctionDescription srcDesc = sourceDescripts.next();
//skip if not in selection
if (!sourceFuncAddrs.contains(srcDesc.getAddress())) {
continue;
}
//skip if self-significance too small
LSHVector srcVector = srcDesc.getSignatureRecord().getLSHVector();
if (vectorFactory.getSelfSignificance(srcVector) <= SELF_SIGNIFICANCE_BOUND) {
continue;
}
Iterator<FunctionDescription> targetDescripts =
gensig.getDescriptionManager().listAllFunctions();
Function srcFunc = getFunction(currentProgram, srcDesc.getAddress());
while (targetDescripts.hasNext()) {
//skip if target before srcFunc in address order
//AND target is one of the source functions (i.e., in funcs)
FunctionDescription targetDesc = targetDescripts.next();
long targetAddress = targetDesc.getAddress();
if (sourceFuncAddrs.contains(targetAddress) &&
targetAddress <= srcDesc.getAddress()) {
continue;
}
//skip if self-significance too small
LSHVector targetVector = targetDesc.getSignatureRecord().getLSHVector();
if (vectorFactory.getSelfSignificance(targetVector) <= SELF_SIGNIFICANCE_BOUND) {
continue;
}
double sim = srcVector.compare(targetVector, vecCompare);
double sig = vectorFactory.calculateSignificance(vecCompare);
if (sig >= MATCH_CONFIDENCE_LOWER_BOUND && MATCH_SIMILARITY_LOWER_BOUND <= sim &&
sim <= MATCH_SIMILARITY_UPPER_BOUND) {
Function targetFunc = getFunction(currentProgram, targetDesc.getAddress());
bsimMatches.add(new LocalBSimMatch(srcFunc, targetFunc, sim, sig));
}
}
}
return bsimMatches;
}
private List<LocalBSimMatch> getMatchesTwoPrograms(Set<Function> srcFuncs,
Program sourceProgram, Program targetProgram) throws LSHException, DecompileException {
List<LocalBSimMatch> bsimMatches = new ArrayList<>();
LSHVectorFactory vectorFactory = getVectorFactory();
GenSignatures srcSigs =
generateSignatures(sourceProgram, srcFuncs.iterator(), srcFuncs.size(), vectorFactory);
FunctionManager targetFuncMan = targetProgram.getFunctionManager();
Iterator<Function> targetFuncIter = targetFuncMan.getFunctions(true);
GenSignatures targetSigs = generateSignatures(targetProgram, targetFuncIter,
targetFuncMan.getFunctionCount(), vectorFactory);
Iterator<FunctionDescription> sourceDescripts =
srcSigs.getDescriptionManager().listAllFunctions();
VectorCompare vecCompare = new VectorCompare();
while (sourceDescripts.hasNext()) {
FunctionDescription srcDesc = sourceDescripts.next();
//skip if self-significance too small
LSHVector srcVector = srcDesc.getSignatureRecord().getLSHVector();
if (vectorFactory.getSelfSignificance(srcVector) <= SELF_SIGNIFICANCE_BOUND) {
continue;
}
Iterator<FunctionDescription> targetDescripts =
targetSigs.getDescriptionManager().listAllFunctions();
Function srcFunc = getFunction(sourceProgram, srcDesc.getAddress());
while (targetDescripts.hasNext()) {
FunctionDescription targetDesc = targetDescripts.next();
//skip if self-significance too small
LSHVector targetVector = targetDesc.getSignatureRecord().getLSHVector();
if (vectorFactory.getSelfSignificance(targetVector) <= SELF_SIGNIFICANCE_BOUND) {
continue;
}
double sim = srcVector.compare(targetVector, vecCompare);
double sig = vectorFactory.calculateSignificance(vecCompare);
if (sig >= MATCH_CONFIDENCE_LOWER_BOUND && MATCH_SIMILARITY_LOWER_BOUND <= sim &&
sim <= MATCH_SIMILARITY_UPPER_BOUND) {
Function targetFunc = getFunction(targetProgram, targetDesc.getAddress());
bsimMatches.add(new LocalBSimMatch(srcFunc, targetFunc, sim, sig));
}
}
}
return bsimMatches;
}
private Function getFunction(Program program, long offset) {
Address addr = program.getAddressFactory().getDefaultAddressSpace().getAddress(offset);
return program.getFunctionManager().getFunctionAt(addr);
}
private LSHVectorFactory getVectorFactory() throws LSHException {
LSHVectorFactory vectorFactory = FunctionDatabase.generateLSHVectorFactory();
Configuration config = FunctionDatabase.loadConfigurationTemplate(TEMPLATE_NAME);
vectorFactory.set(config.weightfactory, config.idflookup, config.info.settings);
return vectorFactory;
}
private GenSignatures generateSignatures(Program program, Iterator<Function> funcs, int count,
LSHVectorFactory vectorFactory) throws LSHException, DecompileException {
GenSignatures gensig = new GenSignatures(false);
gensig.setVectorFactory(vectorFactory);
gensig.openProgram(program, null, null, null, null, null);
gensig.scanFunctions(funcs, count, monitor);
return gensig;
}
class LocalBSimMatch implements Comparable<LocalBSimMatch>, AddressableRowObject {
private Function sourceFunc;
private Function targetFunc;
private double similarity;
private double significance;
public LocalBSimMatch(Function sourceFunc, Function targetFunc, double sim, double signif) {
this.sourceFunc = sourceFunc;
this.targetFunc = targetFunc;
this.similarity = sim;
this.significance = signif;
}
public Function getSourceFunc() {
return sourceFunc;
}
public Function getTargetFunc() {
return targetFunc;
}
public double getSimilarity() {
return similarity;
}
public double getSignificance() {
return significance;
}
public Program getSourceProgram() {
return sourceFunc.getProgram();
}
public Program getTargetProgram() {
return targetFunc.getProgram();
}
@Override
public int compareTo(LocalBSimQueryScript.LocalBSimMatch o) {
return -Double.compare(significance, o.significance);
}
@Override
public Address getAddress() {
return sourceFunc.getEntryPoint();
}
}
/****************************************************************************************
* table stuff
****************************************************************************************/
class CompareMatchesExecutor implements TableChooserExecutor {
private FunctionComparisonService compareService;
private FunctionComparisonProvider comparisonProvider;
public CompareMatchesExecutor() {
compareService = state.getTool().getService(FunctionComparisonService.class);
}
@Override
public String getButtonName() {
return "Compare Selected Matches";
}
@Override
public boolean execute(AddressableRowObject rowObject) {
LocalBSimMatch match = (LocalBSimMatch) rowObject;
if (comparisonProvider == null) {
comparisonProvider =
compareService.compareFunctions(match.getSourceFunc(), match.getTargetFunc());
}
else {
compareService.compareFunctions(match.getSourceFunc(), match.getTargetFunc(),
comparisonProvider);
}
return false;
}
}
private void initializeTable(Program sourceProgram, Program targetProgram) {
StringBuilder titleBuilder = new StringBuilder("Local BSim Matches: ");
titleBuilder.append(sourceProgram.getDomainFile().getPathname());
titleBuilder.append(" -> ");
titleBuilder.append(targetProgram.getDomainFile().getPathname());
tableDialog =
createTableChooserDialog(titleBuilder.toString(), new CompareMatchesExecutor());
configureTableColumns(tableDialog);
tableDialog.setMinimumSize(800, 400);
tableDialog.show();
tableDialog.setMessage(null);
}
private void configureTableColumns(TableChooserDialog dialog) {
ColumnDisplay<Double> simColumn = new AbstractComparableColumnDisplay<Double>() {
@Override
public Double getColumnValue(AddressableRowObject rowObject) {
return ((LocalBSimMatch) rowObject).getSimilarity();
}
@Override
public String getColumnName() {
return "Similarity";
}
};
ColumnDisplay<Double> sigColumn = new AbstractComparableColumnDisplay<Double>() {
@Override
public Double getColumnValue(AddressableRowObject rowObject) {
return ((LocalBSimMatch) rowObject).getSignificance();
}
@Override
public String getColumnName() {
return "Significance";
}
};
StringColumnDisplay sourceFuncColumn = new StringColumnDisplay() {
@Override
public String getColumnValue(AddressableRowObject rowObject) {
return ((LocalBSimMatch) rowObject).getSourceFunc().getName(true);
}
@Override
public String getColumnName() {
return "Source Function";
}
};
StringColumnDisplay targetFuncColumn = new StringColumnDisplay() {
@Override
public String getColumnValue(AddressableRowObject rowObject) {
return ((LocalBSimMatch) rowObject).getTargetFunc().getName(true);
}
@Override
public String getColumnName() {
return "Target Function";
}
};
dialog.addCustomColumn(simColumn);
dialog.addCustomColumn(sigColumn);
dialog.addCustomColumn(sourceFuncColumn);
dialog.addCustomColumn(targetFuncColumn);
}
}

View File

@ -0,0 +1,108 @@
/* ###
* IP: GHIDRA
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
// Example of querying a BSim database about a single function
//@category BSim
import java.net.URL;
import java.util.Iterator;
import ghidra.app.script.GhidraScript;
import ghidra.features.bsim.query.*;
import ghidra.features.bsim.query.description.*;
import ghidra.features.bsim.query.protocol.*;
import ghidra.program.model.listing.Function;
public class QueryFunction extends GhidraScript {
//GenSignatures gensig;
//FunctionDatabase database;
private static final int MATCHES_PER_FUNC = 10;
private static final double SIMILARITY_BOUND = 0.7;
private static final double CONFIDENCE_BOUND = 0.0;
@Override
public void run() throws Exception {
if (currentProgram == null) {
return;
}
Function func = this.getFunctionContaining(this.currentAddress);
if (func == null){
popup("No function selected!");
return;
}
String DATABASE_URL = askString("Enter Database URL", "URL");
URL url = BSimClientFactory.deriveBSimURL(DATABASE_URL);
try (FunctionDatabase database = BSimClientFactory.buildClient(url, false)) {
if (!database.initialize()) {
println(database.getLastError().message);
return;
}
GenSignatures gensig = new GenSignatures(false);
try {
gensig.setVectorFactory(database.getLSHVectorFactory());
gensig.openProgram(currentProgram, null, null, null, null, null);
DescriptionManager manager = gensig.getDescriptionManager();
gensig.scanFunction(func);
QueryNearest query = new QueryNearest();
query.manage = manager;
query.max = MATCHES_PER_FUNC;
query.thresh = SIMILARITY_BOUND;
query.signifthresh = CONFIDENCE_BOUND;
ResponseNearest response = query.execute(database);
if (response == null) {
println(database.getLastError().message);
return;
}
Iterator<SimilarityResult> iter = response.result.iterator();
StringBuffer buf = new StringBuffer();
while (iter.hasNext()) {
SimilarityResult sim = iter.next();
FunctionDescription base = sim.getBase();
ExecutableRecord exe = base.getExecutableRecord();
buf.append("\nExecutable: ")
.append(exe.getNameExec())
.append("\nFunction: ")
.append(base.getFunctionName())
.append('\n');
Iterator<SimilarityNote> subiter = sim.iterator();
while (subiter.hasNext()) {
SimilarityNote note = subiter.next();
FunctionDescription fdesc = note.getFunctionDescription();
ExecutableRecord exerec = fdesc.getExecutableRecord();
buf.append(" Executable: ");
buf.append(exerec.getNameExec())
.append("\n Matching Function name: ")
.append(fdesc.getFunctionName());
buf.append("\n Similarity: ").append(note.getSimilarity());
buf.append("\n Significance: ").append(note.getSignificance());
buf.append("\n\n");
}
}
println(buf.toString());
}
finally {
gensig.dispose();
}
}
}
}

View File

@ -0,0 +1,78 @@
## ###
# IP: GHIDRA
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
##
# Example of performing a BSim query on a single function
# @category BSim.python
import ghidra.query.BSimClientFactory as BSimClientFactory
import ghidra.query.GenSignatures as GenSignatures
import ghidra.query.protocol.QueryNearest as QueryNearest
MATCHES_PER_FUNC = 100
SIMILARITY_BOUND = 0.7
CONFIDENCE_BOUND = 0.0
def query(func):
DATABASE_URL = askString("Enter Database URL", "URL")
url = BSimClientFactory.deriveBSimURL(DATABASE_URL)
database = BSimClientFactory.buildClient(url,False)
if not database.initialize():
print database.getLastError().message
return
gensig = GenSignatures(False)
gensig.setVectorFactory(database.getLSHVectorFactory())
gensig.openProgram(currentProgram,None,None,None,None,None)
gensig.scanFunction(func)
query = QueryNearest()
query.manage = gensig.getDescriptionManager()
query.max = MATCHES_PER_FUNC
query.thresh = SIMILARITY_BOUND
query.signifthresh = CONFIDENCE_BOUND
response = database.query(query)
if response is None:
print database.getLastError().message
return
simIter = response.result.iterator()
while simIter.hasNext():
sim = simIter.next()
base = sim.getBase()
exe = base.getExecutableRecord()
print "Source executable: %s; source function: %s" % (exe.getNameExec(),base.getFunctionName())
subIter = sim.iterator()
while subIter.hasNext():
note = subIter.next()
fdesc = note.getFunctionDescription()
exerec = fdesc.getExecutableRecord()
print " Executable: %s" % exerec.getNameExec()
print " Matching Function name: %s " % fdesc.getFunctionName()
print " Similarity: %f" % note.getSimilarity()
print " Significance: %f\n" % note.getSignificance()
gensig.dispose()
database.close()
return;
if currentProgram is None:
popup("currentProgram is None!")
else:
func = currentProgram.getFunctionManager().getFunctionContaining(currentAddress)
if func is None:
popup("Cursor must be in a function!")
else:
query(func)

View File

@ -0,0 +1,333 @@
/* ###
* IP: GHIDRA
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
//Example of a script to perform a more involved BSim query.
//@category BSim
import java.util.*;
import java.util.function.BiPredicate;
import ghidra.app.script.GhidraScript;
import ghidra.features.bsim.gui.filters.*;
import ghidra.features.bsim.gui.search.results.BSimMatchResult;
import ghidra.features.bsim.gui.search.results.ExecutableResult;
import ghidra.features.bsim.query.FunctionDatabase;
import ghidra.features.bsim.query.FunctionDatabase.ErrorCategory;
import ghidra.features.bsim.query.description.FunctionDescription;
import ghidra.features.bsim.query.facade.*;
import ghidra.features.bsim.query.protocol.BSimFilter;
import ghidra.features.bsim.query.protocol.PreFilter;
import ghidra.program.database.symbol.FunctionSymbol;
import ghidra.program.model.address.Address;
import ghidra.program.model.listing.*;
import ghidra.program.model.symbol.SourceType;
import ghidra.util.exception.CancelledException;
/**
* Script showing how to apply filters to a BSim query. Currently we support three types
* of filters, described below:
*
* 1. QUERY THRESHOLDS
* These are the items at the top of the BSim query dialog:
* Similarity
* Confidence
* Matches per Function
* These are server-side filters that will be applied when the db is queried.
*
* 2. PREFILTERS
* Allows users to identify functions that meet certain criteria by specifying
* {@link BiPredicate}s. Any functions matching the predicate(s) will be included
* in the result set.
*
* 3. EXECUTABLE FILTERS
* These are predefined filters that can be applied on the server or on the
* client (applied only to the results of a query). On the BSim query
* dialog these are the items in the filter pulldown menu.
* @see BSimFilterType
*
* SCRIPT FLOW
* This example script does the following:
*
* 1) Set threshold filters
* 2) Set prefilters
* 3) Set executable filters
* 4) Query the database & print results
* 5) Set new executable filters
* 6) Print results
*
* NOTES: 1. You will be queried for the location of the BSim database. This URL
* will take the form "ghidra://<ip address>/<database name>
*
* 2. This script is only an example - the specific filters demonstrated
* here will not necessarily apply to what's in your BSim database.
*
*/
public class QueryWithFiltersScript extends GhidraScript {
// Threshold settings.
private static final int MAX_NUM_FUNCTIONS = 100;
private static final double SIMILARITY_BOUND = 0.7;
private static final double SIGNIFICANCE_BOUND = 0.0;
// Restricts the number of results.
private static final int NUM_EXES_TO_DISPLAY = 10;
// Prefilter value we'll be setting.
private static final double SELF_SIGNIFICANCE_BOUND = 40.0;
private HashSet<FunctionSymbol> funcsToQuery;
private SimilarFunctionQueryService queryService;
private SFQueryInfo queryInfo;
private BSimFilter bsimFilter;
@Override
protected void run() throws Exception {
funcsToQuery = getFunctionsToQuery(currentProgram);
queryService = new SimilarFunctionQueryService(currentProgram);
queryInfo = new SFQueryInfo(funcsToQuery);
bsimFilter = queryInfo.getBsimFilter();
// Add threshold filters.
queryInfo.setMaximumResults(MAX_NUM_FUNCTIONS);
queryInfo.setSimilarityThreshold(SIMILARITY_BOUND);
queryInfo.setSignificanceThreshold(SIGNIFICANCE_BOUND);
// Add prefilters.
setPrefilters();
// Add a simple date filter.
addBsimFilter(new DateLaterBSimFilterType(""), "01/01/1776");
// Demonstration of a filter that allows for multiple entries. All filters but the
// DateEarlier and DateLater allow this. The effect is that each filter will be OR'd
// with the others. This is effectively the same as creating three distinct ArchEquals filters.
//
// ie: "The architecture can equal x86:LE:64:default OR the architecture can equal
// ARM:LE_32:v4 OR ...."
addBsimFilter(new ArchitectureBSimFilterType(),
"x86:LE:64:default, x86:LE:32:default, ARM:LE:32:v4");
// Another filter with multiple entries, but in this case since it is a "NotEqual" filter,
// the items are "AND'd together.
//
// ie: "The compiler cannot equal windows AND the compiler cannot equal foo_compiler".
addBsimFilter(new CompilerBSimFilterType(), "windows, foo_compiler");
//connect to the database
try {
String dbUrl =
askString("", "Enter the URL of the BSim database:", "ghidra://localhost/bsimDb");
queryService.initializeDatabase(dbUrl);
FunctionDatabase.Error error = queryService.getLastError();
if (error != null && error.category == ErrorCategory.Nodatabase) {
println("Database [" + dbUrl + "] cannot be found (does it exist?)");
return;
}
}
catch (QueryDatabaseException e) {
println(e.getMessage());
return;
}
// Execute query and print results.
List<BSimMatchResult> resultRows = executeQuery(queryInfo);
printFunctionQueryResults(resultRows, "\nFunction-level results before filtering");
// Add some simple post-query filters. These filters will only be applied to the result
// set returned from the previous query.
addBsimFilter(new Md5BSimFilterType(), currentProgram.getExecutableMD5());
addBsimFilter(new CompilerBSimFilterType(), "gcc");
addBsimFilter(new FunctionTagBSimFilterType("KNOWN_LIBRARY", queryService),
"false");
// Apply the filters and print results.
List<BSimMatchResult> filteredRows =
BSimMatchResult.filterMatchRows(bsimFilter, resultRows);
printFunctionQueryResults(filteredRows, "\nFunction-level results after filtering");
printExecutableInformation(filteredRows);
}
@Override
public void cleanup(boolean success) {
if (queryService != null) {
queryService.dispose();
}
}
/***********************************************************************
* PRIVATE METHODS
***********************************************************************/
/**
* Adds a filter to the given filter container.
*
* @param filterTemplate the filter type to add
* @param value the value of the filter
*/
private void addBsimFilter(BSimFilterType filterTemplate, String value) {
String[] inputs = value.split(",");
for (String input : inputs) {
if (!input.trim().isEmpty()) {
bsimFilter.addAtom(filterTemplate, input.trim());
}
}
}
/**
* Queries the database and returns the results.
*
* @param qInfo contains all information required for the query
* @return list of matches
* @throws QueryDatabaseException if there is a problem executing the query similar functions query
* @throws CancelledException if the user cancelled the operation
*/
private List<BSimMatchResult> executeQuery(SFQueryInfo qInfo)
throws QueryDatabaseException, CancelledException {
SFQueryResult queryResults = queryService.querySimilarFunctions(qInfo, null, monitor);
List<BSimMatchResult> resultRows =
BSimMatchResult.generate(queryResults.getSimilarityResults(), currentProgram);
return resultRows;
}
/**
* Creates predicates that will be used to filter out functions. This example provides three
* different methods of doing this:
*
* - anonymous class
* - lambda
* - static method
*
* These are all possible because the filter takes a {@link BiPredicate}, which is a
* functional interface.
*
*/
private void setPrefilters() {
PreFilter preFilter = queryInfo.getPreFilter();
//
// Option 1: Anonymous class
// Filters out any functions with a self significance less than a
// certain value.
//
preFilter.addPredicate(new BiPredicate<Program, FunctionDescription>() {
@Override
public boolean test(Program t, FunctionDescription u) {
return queryService.getLSHVectorFactory()
.getSelfSignificance(
u.getSignatureRecord().getLSHVector()) >= SELF_SIGNIFICANCE_BOUND;
}
});
//
// Option 2. Lambda expression
// Filters out any functions with a self significance less than a
// certain value.
//
preFilter.addPredicate((x, y) -> queryService.getLSHVectorFactory()
.getSelfSignificance(
y.getSignatureRecord().getLSHVector()) >= SELF_SIGNIFICANCE_BOUND);
//
// Option 3. Static method
// Filters out any functions that are of type ANALYSIS.
//
preFilter.addPredicate(QueryWithFiltersScript::isNotAnalysisSourceType);
}
/**
* Returns a set of ALL functions (no stubs) in the given program.
*
* @param program the program to get the functions from
* @return list of function symbols
*/
private HashSet<FunctionSymbol> getFunctionsToQuery(Program program) {
HashSet<FunctionSymbol> functions = new HashSet<>();
FunctionIterator fIter = program.getFunctionManager().getFunctionsNoStubs(true);
for (Function func : fIter) {
functions.add((FunctionSymbol) func.getSymbol());
}
return functions;
}
/**
* Returns true if the given function is NOT an analysis type.
*
* @param program the current program
* @param funcDesc the function description object
* @return true if the symbol is NOT an analysis source type
*/
public static boolean isNotAnalysisSourceType(Program program, FunctionDescription funcDesc) {
Address address =
program.getAddressFactory().getDefaultAddressSpace().getAddress(funcDesc.getAddress());
Function function = program.getFunctionManager().getFunctionAt(address);
if (function == null || function.getName().equals(funcDesc.getFunctionName())) {
return false;
}
return function.getSymbol().getSource() != SourceType.ANALYSIS;
}
/**
* Prints a sorted list of executables represented in the function matches.
*
* @param filteredRows list of function results
*/
private void printExecutableInformation(List<BSimMatchResult> filteredRows) {
TreeSet<ExecutableResult> execrows = ExecutableResult.generateFromMatchRows(filteredRows);
ExecutableResult[] results = new ExecutableResult[execrows.size()];
results = execrows.toArray(results);
Arrays.sort(results, new Comparator<ExecutableResult>() {
@Override
public int compare(ExecutableResult o1, ExecutableResult o2) {
return Double.compare(o2.getSignificanceSum(), o1.getSignificanceSum());
}
});
printf("Executable-level results:\n");
for (int i = 0, max = Math.min(NUM_EXES_TO_DISPLAY, results.length); i < max; ++i) {
printf(" MD5: %s\n", results[i].getExecutableRecord().getMd5());
printf(" Executable Name: %s\n", results[i].getExecutableRecord().getNameExec());
printf(" Function Count: %d\n", results[i].getFunctionCount());
printf(" Significance Sum: %f\n\n", results[i].getSignificanceSum());
}
}
/**
* Prints information about each function in the result set.
*
* @param resultRows the list of rows containing the info to print
* @param title the title to print
*/
private void printFunctionQueryResults(List<BSimMatchResult> resultRows, String title) {
printf(title + ": (%d)\n\n", resultRows.size());
for (BSimMatchResult resultRow : resultRows) {
printf(" queried function: %s\n",
resultRow.getOriginalFunctionDescription().getFunctionName());
printf(" matching function: %s\n",
resultRow.getMatchFunctionDescription().getFunctionName());
printf(" executable of matching function: %s\n",
resultRow.getMatchFunctionDescription().getExecutableRecord().getNameExec());
printf(" similarity: %f\n", resultRow.getSimilarity());
printf(" significance: %f\n\n", resultRow.getSignificance());
}
printf("\n");
}
}

View File

@ -0,0 +1,173 @@
## ###
# IP: GHIDRA
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
##
# Advanced example of BSim querying
# @category BSim.python
import ghidra.query.facade.SimilarFunctionQueryService as SimilarFunctionQueryService
import ghidra.query.facade.SFQueryInfo as SFQueryInfo
import ghidra.query.FunctionDatabase as FunctionDatabase
import ghidra.query.facade.QueryDatabaseException as QueryDatabaseException
import java.util.HashSet as HashSet
import ghidra.app.plugin.core.query.QueryNearestRow as QueryNearestRow
import java.util.function.BiPredicate as BiPredicate
import ghidra.query.protocol.FilterTemplate as FilterTemplate
import ghidra.app.plugin.core.query.ExecutableResult as ExecutableResult
import java.util.Comparator as Comparator
import java.util.Arrays as Arrays
import java.lang.Double as Double
#Query thresholds
MAX_NUM_FUNCTIONS = 100
SIMILARITY_BOUND = 0.7
SIGNIFICANCE_BOUND = 0.0
#limit the number of results displayed
NUM_EXES_TO_DISPLAY = 10
#for prefiltering: this number will be used to filter out small functions
SELF_SIGNIFICANCE_BOUND = 40.0
def run():
#get the set of functions to query
funcsToQuery = getFunctionsToQuery()
#sets up the object required for querying the database
queryService = SimilarFunctionQueryService(currentProgram)
queryInfo = SFQueryInfo(funcsToQuery)
bsimFilter = queryInfo.getBsimFilter()
#sets the query parameters.
#change the defined constants to control how fuzzy of
#a match you're willing to accept, and the maximum number
#of matches to return for each function
queryInfo.setMaximumResults(MAX_NUM_FUNCTIONS)
queryInfo.setSimilarityThreshold(SIMILARITY_BOUND)
queryInfo.setSignificanceThreshold(SIGNIFICANCE_BOUND)
#add the prefilters
setPrefilters(queryService, queryInfo)
#add a filter on the date
addBsimFilter(bsimFilter, FilterTemplate.DateLater(""), "01/01/1776")
#add a filter with multiple values. Since this is an "Equal" filter, the results are OR'd together
#so a given executable will pass the main filter if it passes at least one of the subfilters
addBsimFilter(bsimFilter, FilterTemplate.ArchEquals(),"x86:LE:64:default, x86:LE:32:default, ARM:LE:32:v4")
#now add a "notequal" filter
#to pass, the compiler can't be windows and it can't be foo_compiler
addBsimFilter(bsimFilter,FilterTemplate.CompNotEqual(),"windows, foo_compiler")
#establish a connection to the BSim database
try:
dbUrl = askString("","Enter the URL of the BSim database:", "ghidra://localhost/bsimDB")
queryService.initializeDatabase(dbUrl)
error = queryService.getDatabase().getLastError()
if error is not None and (error.category is ErrorCategory.Nodatabase):
print "Database [%s] cannot be found (does it exist?)" % dbUrl
return
except QueryDatabaseException as e:
print e.getMessage()
return
resultRows = executeQuery(queryService,queryInfo)
printFunctionQueryResults(resultRows, "\nFunction-level results before filtering")
#now add some post-query filters, which filters the result set returned by the previous query
addBsimFilter(bsimFilter, FilterTemplate.Md5NotEqual(), currentProgram.getExecutableMD5())
addBsimFilter(bsimFilter, FilterTemplate.CompilerEquals(), "gcc")
addBsimFilter(bsimFilter, FilterTemplate.FunctionTagTemplate("KNOWN_LIBRARY", queryService), "false")
#apply the filters and print the results
filteredRows = QueryNearestRow.filterMatchRows(bsimFilter, resultRows)
printFunctionQueryResults(filteredRows, "\nFunction-level results after filtering")
printExecutableInformation(filteredRows)
return
#collect the functions to query from currentProgram
def getFunctionsToQuery():
functions = HashSet();
fIter = currentProgram.getFunctionManager().getFunctionsNoStubs(True)
for func in fIter:
functions.add(func.getSymbol())
return functions
#query the database
def executeQuery(queryService,queryInfo):
queryResults = queryService.querySimilarFunctions(queryInfo,monitor)
resultRows = QueryNearestRow.generate(queryResults.getSimilarityResults(),currentProgram)
return resultRows
def printFunctionQueryResults(resultRows, title):
print "%s: %d\n\n" % (title, resultRows.size())
for row in resultRows:
print " queried function: %s" % row.getOriginalFunctionDescription().getFunctionName()
print " matching function: %s" % row.getMatchFunctionDescription().getFunctionName()
print " executable of matching function: %s" % row.getMatchFunctionDescription().getExecutableRecord().getNameExec()
print " similarity: %f" % row.getSimilarity()
print " significance: %f\n" % row.getSignificance()
#Prefilters are used to filter out functions before sending a query to the database
#A typical use case would be to collect all functions in a binary, then use a
#prefilter to remove the functions with low self-significance (which is the
#"BSim way" to remove small functions)
def setPrefilters(queryService, queryInfo):
preFilter = queryInfo.getPreFilter();
selfSigFilter = ExampleFilter(queryService)
preFilter.addPredicate(selfSigFilter)
class ExampleFilter(BiPredicate):
def __init__(self, queryService):
self.queryService = queryService
def test(self,program, fdesc):
return self.queryService.getLSHVectorFactory().getSelfSignificance(fdesc.getSignatureRecord().getLSHVector()) >= SELF_SIGNIFICANCE_BOUND
def addBsimFilter(bsimFilter, filterTemplate, values):
for value in values.split(","):
if len(value.strip()) > 0:
bsimFilter.addAtom(filterTemplate, value.strip(), FilterTemplate.Blank())
#calls the methods to aggregate executable-level information about the matches
def printExecutableInformation(filteredRows):
execrows = ExecutableResult.generateFromMatchRows(filteredRows)
results = execrows.toArray()
sorter = Sorter()
Arrays.sort(results,sorter)
print "Executable-level results:"
numExes = min(len(results),NUM_EXES_TO_DISPLAY)
for i in range (numExes):
print " MD5: %s" % results[i].getExecutableRecord().getMd5()
print " Executable Name: %s" % results[i].getExecutableRecord().getNameExec()
print " Function Count: %d" % results[i].getFunctionCount()
print " Significance Sum: %f\n" % results[i].getSignificanceSum()
return
class Sorter(Comparator):
def __init__(self):
return
def compare(self,o1,o2):
return Double.compare(o2.getSignificanceSum(), o1.getSignificanceSum())
run()

View File

@ -0,0 +1,45 @@
/* ###
* IP: GHIDRA
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
import org.apache.commons.lang3.StringUtils;
import ghidra.app.script.GhidraScript;
import ghidra.framework.options.Options;
import ghidra.program.model.listing.Program;
//@category BSim
//sets a property on the current program which can be used as
//an executable category in BSim
public class SetExecutableCategoryScript extends GhidraScript {
@Override
protected void run() throws Exception {
if (currentProgram == null) {
popup("This script requires a program");
return;
}
Options opts = currentProgram.getOptions(Program.PROGRAM_INFO);
String name = askString("Enter Property Name", "Name");
if (StringUtils.isAllBlank(name)) {
return;
}
String value = askString("Enter Value of Property " + name, "Value");
if (StringUtils.isAllBlank(value)) {
return;
}
opts.setString(name, value);
}
}

View File

@ -0,0 +1,56 @@
/* ###
* IP: GHIDRA
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
import ghidra.app.script.GhidraScript;
import ghidra.framework.options.Options;
import ghidra.program.model.listing.Program;
// Setup tailored auto-analysis (in place of the headless analyzers full auto-analysis)
// suitable for BSim ingest process. Intended to be invoked as an analyzeHeadless -preScript
//@category BSim
public class TailoredAnalysis extends GhidraScript {
@Override
public void run() throws Exception {
Options pl = currentProgram.getOptions(Program.ANALYSIS_PROPERTIES);
pl.setBoolean("Decompiler Parameter ID", false);
// These analyzers generate lots of cross references, which are not necessary for
// signature analysis, and take time to run. On the other hand, you may want
// them in general to facilitate general analysis
pl.setBoolean("Stack", false);
// pl.setBoolean("Windows x86 PE Instruction References", false);
// pl.setBoolean("Windows x86 PE C++", false);
// pl.setBoolean("Windows x86 PE Preliminary", false);
// pl.setBoolean("ELF Scalar Operand References", false);
// Mangled symbols are good information but you may not be able to count on them being present in all versions
// Options analyzerOptions = pl.getOptions("Demangler");
// analyzerOptions.setBoolean("Commit Function Signatures", false);
// You really want these options turned on
pl.setBoolean("Shared Return Calls",true);
pl.setBoolean("Function Start Search", true);
pl.setBoolean("DWARF", false);
// Options analyzerOptions = pl.getOptions("Function Start Search");
// analyzerOptions.setBoolean("Search Data Blocks", true);
// analyzerOptions = pl.getOptions("Function Start Search After Code");
// analyzerOptions.setBoolean("Search Data Blocks", true);
// analyzerOptions = pl.getOptions("Function Start Search After Data");
// analyzerOptions.setBoolean("Search Data Blocks", true);
}
}

View File

@ -0,0 +1,103 @@
/* ###
* IP: GHIDRA
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
// Push updated information about function names and other metadata from the current program to a BSim database
//@category BSim
import java.net.URL;
import ghidra.app.script.GhidraScript;
import ghidra.features.bsim.query.*;
import ghidra.features.bsim.query.description.ExecutableRecord;
import ghidra.features.bsim.query.description.FunctionDescription;
import ghidra.features.bsim.query.protocol.QueryUpdate;
import ghidra.features.bsim.query.protocol.ResponseUpdate;
import ghidra.program.model.listing.FunctionIterator;
import ghidra.program.model.listing.FunctionManager;
public class UpdateBSimMetadata extends GhidraScript {
@Override
protected void run() throws Exception {
if (currentProgram == null) {
return;
}
String bsim_url = System.getProperty("ghidra.bsimurl");
if (bsim_url==null || bsim_url.length()==0) {
bsim_url = askString("Request Repository", "Select URL of database receiving update");
}
URL url = BSimClientFactory.deriveBSimURL(bsim_url);
try (FunctionDatabase database = BSimClientFactory.buildClient(url, true)) {
if (!database.initialize()) {
println(database.getLastError().message);
return;
}
println("Connected to " + database.getInfo().databasename);
GenSignatures gensig = new GenSignatures(false);
gensig.setVectorFactory(database.getLSHVectorFactory());
gensig.openProgram(currentProgram, null, null, null, null, null);
FunctionManager functionManager = currentProgram.getFunctionManager();
FunctionIterator funciter;
if (currentSelection != null) {
println("Scanning selected functions");
funciter = functionManager.getFunctions(currentSelection, true);
}
else {
println("Scanning all functions");
funciter = functionManager.getFunctions(true); // If no highlight, update all functions
}
gensig.scanFunctionsMetadata(funciter, monitor);
QueryUpdate update = new QueryUpdate();
update.manage = gensig.getDescriptionManager();
ResponseUpdate respup = update.execute(database); // Try to push the update
if (respup == null) {
println(database.getLastError().message);
return;
}
if (!respup.badexe.isEmpty()) {
for (int j = 0; j < respup.badexe.size(); ++j) {
ExecutableRecord erec = respup.badexe.get(j);
println("Database does not contain executable: " + erec.getNameExec());
}
}
if (!respup.badfunc.isEmpty()) {
int max = respup.badfunc.size();
if (max > 10) {
println(
"Could not find " + Integer.toString(respup.badfunc.size()) + " functions");
max = 10;
}
for (int j = 0; j < max; ++j) {
FunctionDescription func = respup.badfunc.get(j);
println("Could not update function " + func.getFunctionName());
}
}
if (respup.exeupdate > 0) {
println("Updated executable metadata");
}
if (respup.funcupdate > 0) {
println("Updated " + Integer.toString(respup.funcupdate) + " functions");
}
if (respup.exeupdate == 0 && respup.funcupdate == 0) {
println("No changes");
}
}
}
}

View File

@ -0,0 +1,126 @@
#!/bin/bash
## ###
# IP: GHIDRA
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
##
#
# This script may be used to build the postgresql server within
# a GHIDRA installation. The postgresql server configuration options
# below (POSTGRES_CONFIG_OPTIONS) may be adjusted if required
# (e.g., build without openssl use, etc.).
#
# See https://www.postgresql.org/docs/10/install-procedure.html
# for supported postgresql config options.
#
# Additional packages may need to be installed include to perform the
# postgresql build. Please refer to the following web page for
# package dependencies:
#
# https://wiki.postgresql.org/wiki/Compile_and_Install_from_source_code
#
# The postgresql source distribution should reside within the BSim module
# directory prior to running this script. Within development environments
# it will first check the ghidra.bin repo for this source file.
#
POSTGRES=postgresql-15.3
POSTGRES_GZ=${POSTGRES}.tar.gz
POSTGRES_CONFIG_OPTIONS="--disable-rpath --with-openssl"
DIR=$(cd `dirname $0`; pwd)
POSTGRES_GZ_PATH=${DIR}/../../../../ghidra.bin/Ghidra/Features/BSim/${POSTGRES_GZ}
if [ ! -f "${POSTGRES_GZ_PATH}" ]; then
POSTGRES_GZ_PATH=${DIR}/${POSTGRES_GZ}
if [ ! -f "${POSTGRES_GZ_PATH}" ]; then
echo "Postgres source bundle not found: ${POSTGRES_GZ_PATH}"
exit -1
fi
fi
OS=`uname -s`
ARCH=`arch`
cd ${DIR}
mkdir -p build > /dev/null
if [ ! -d build/${POSTGRES} ]; then
# Unpack postgres source distro into build
echo "Unpacking postgresql source: ${POSTGRES_GZ_PATH}"
$(cd build; tar -xzf ${POSTGRES_GZ_PATH} )
fi
# Build postgresql
pushd build/${POSTGRES}
if [ "$OS" = "Darwin" ]; then
export MACOSX_DEPLOYMENT_TARGET=10.5
export ARCHFLAGS="-arch x86_64"
OSDIR=mac_x86_64
elif [ "$ARCH" = "x86_64" ]; then
OSDIR=linux_x86_64
else
echo "Unsupported platform: $OS $ARCH"
exit -1
fi
# Install within build/os
INSTALL_DIR=${DIR}/build/os/${OSDIR}/postgresql
rm -rf ${INSTALL_DIR} > /dev/null
make distclean
# Configure postgres
./configure ${POSTGRES_CONFIG_OPTIONS} --prefix=${INSTALL_DIR}
if [ $? != 0 ]; then
exit $?
fi
make install
if [ $? != 0 ]; then
exit $?
fi
make -C contrib/pg_prewarm install
if [ $? != 0 ]; then
exit $?
fi
echo "Completed postgresql build"
# Build lshvector plugin for postgresql
popd
rm -rf build/lshvector > /dev/null
mkdir build/lshvector
echo "Building lshvector plugin..."
cp src/lshvector/* build/lshvector
cp src/lshvector/c/* build/lshvector
cd build/lshvector
make -f Makefile.lshvector install PG_CONFIG=${INSTALL_DIR}/bin/pg_config
if [ $? = 0 ]; then
echo "Completed build and install of lshvector postgresql plugin"
exit 0
fi
exit -1

View File

@ -0,0 +1,34 @@
/* ###
* IP: GHIDRA
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
import ghidra.app.script.GhidraScript;
import ghidra.framework.options.Options;
import ghidra.program.model.listing.Program;
/**
* This script is used by the unit test BSimServerTest
*/
public class InstallMetadataTest extends GhidraScript {
@Override
protected void run() throws Exception {
Options pl = currentProgram.getOptions(Program.PROGRAM_INFO);
String value = "static";
if (currentProgram.getName().contains(".so"))
value = "shared";
pl.setString("Test Category", value);
}
}

View File

@ -0,0 +1,69 @@
/* ###
* IP: GHIDRA
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
import java.io.File;
import java.io.FileWriter;
import java.io.IOException;
import java.util.ArrayList;
import java.util.Iterator;
import java.util.List;
import generic.lsh.vector.LSHVectorFactory;
import ghidra.app.script.GhidraScript;
import ghidra.program.model.listing.Function;
import ghidra.program.model.listing.FunctionManager;
import ghidra.features.bsim.query.FunctionDatabase;
import ghidra.features.bsim.query.GenSignatures;
import ghidra.features.bsim.query.client.Configuration;
import ghidra.features.bsim.query.description.DescriptionManager;
/**
* This script is used by the unit test BSimServerTest
*/
public class RegressionSignatures extends GhidraScript {
@Override
protected void run() throws Exception {
String md5string = currentProgram.getExecutableMD5();
if ((md5string == null) || (md5string.length() < 10))
throw new IOException("Could not get MD5 on file: " + currentProgram.getName());
String basename = "sigs_" + md5string;
File file = null;
// This form of askString will work for both standalone execution or for parallel
File workingdir = askDirectory("RegressionSignatures:", "Working directory");
file = new File(workingdir, basename);
LSHVectorFactory vectorFactory = FunctionDatabase.generateLSHVectorFactory();
Configuration config = FunctionDatabase.loadConfigurationTemplate("medium_64");
vectorFactory.set(config.weightfactory, config.idflookup, config.info.settings);
GenSignatures gensig = new GenSignatures(true);
gensig.setVectorFactory(vectorFactory);
List<String> names = new ArrayList<String>();
names.add("Test Category");
gensig.addExecutableCategories(names);
String repo = "ghidra://localhost/repo";
String path = "/raw";
gensig.openProgram(this.currentProgram, null, null, null, repo, path);
FunctionManager fman = currentProgram.getFunctionManager();
Iterator<Function> iter = fman.getFunctions(true);
gensig.scanFunctions(iter, fman.getFunctionCount(), monitor);
FileWriter fwrite = new FileWriter(file);
DescriptionManager manager = gensig.getDescriptionManager();
manager.saveXml(fwrite);
fwrite.close();
}
}

View File

@ -0,0 +1,25 @@
# Locality Sensitive Hashing package
# NOTE: This file cannot be executed in place. It is copied into a temporary
# directory with its source code and executed there.
ifeq ($(PG_CONFIG),)
default:
echo "You must specifiy PG_CONFIG"
false
endif
MODULE_big = lshvector
OBJS= lsh.o weights.o binhash.o crc32.o
EXTENSION = lshvector
DATA = lshvector--1.0.sql
REGRESS = lshvector
EXTRA_CLEAN =
SHLIB_LINK += $(filter -lm, $(LIBS))
PGXS := $(shell $(PG_CONFIG) --pgxs)
include $(PGXS)

View File

@ -0,0 +1,277 @@
/* ###
* IP: GHIDRA
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
#include "lsh.h"
#define LSH_HASHBASE 0xD7E6A299
static char hash_signtable[512];
static void hash_int_fft_16(int32 *arr)
{
int32 x,y;
x = arr[0]; y = arr[8]; arr[0] = x + y; arr[8] = x - y;
x = arr[1]; y = arr[9]; arr[1] = x + y; arr[9] = x - y;
x = arr[2]; y = arr[10]; arr[2] = x + y; arr[10] = x - y;
x = arr[3]; y = arr[11]; arr[3] = x + y; arr[11] = x - y;
x = arr[4]; y = arr[12]; arr[4] = x + y; arr[12] = x - y;
x = arr[5]; y = arr[13]; arr[5] = x + y; arr[13] = x - y;
x = arr[6]; y = arr[14]; arr[6] = x + y; arr[14] = x - y;
x = arr[7]; y = arr[15]; arr[7] = x + y; arr[15] = x - y;
x = arr[0]; y = arr[4]; arr[0] = x + y; arr[4] = x - y;
x = arr[1]; y = arr[5]; arr[1] = x + y; arr[5] = x - y;
x = arr[2]; y = arr[6]; arr[2] = x + y; arr[6] = x - y;
x = arr[3]; y = arr[7]; arr[3] = x + y; arr[7] = x - y;
x = arr[8]; y = arr[12]; arr[8] = x + y; arr[12] = x - y;
x = arr[9]; y = arr[13]; arr[9] = x + y; arr[13] = x - y;
x = arr[10]; y = arr[14]; arr[10] = x + y; arr[14] = x - y;
x = arr[11]; y = arr[15]; arr[11] = x + y; arr[15] = x - y;
x = arr[0]; y = arr[2]; arr[0] = x + y; arr[2] = x - y;
x = arr[1]; y = arr[3]; arr[1] = x + y; arr[3] = x - y;
x = arr[4]; y = arr[6]; arr[4] = x + y; arr[6] = x - y;
x = arr[5]; y = arr[7]; arr[5] = x + y; arr[7] = x - y;
x = arr[8]; y = arr[10]; arr[8] = x + y; arr[10] = x - y;
x = arr[9]; y = arr[11]; arr[9] = x + y; arr[11] = x - y;
x = arr[12]; y = arr[14]; arr[12] = x + y; arr[14] = x - y;
x = arr[13]; y = arr[15]; arr[13] = x + y; arr[15] = x - y;
x = arr[0]; y = arr[1]; arr[0] = x + y; arr[1] = x - y;
x = arr[2]; y = arr[3]; arr[2] = x + y; arr[3] = x - y;
x = arr[4]; y = arr[5]; arr[4] = x + y; arr[5] = x - y;
x = arr[6]; y = arr[7]; arr[6] = x + y; arr[7] = x - y;
x = arr[8]; y = arr[9]; arr[8] = x + y; arr[9] = x - y;
x = arr[10]; y = arr[11]; arr[10] = x + y; arr[11] = x - y;
x = arr[12]; y = arr[13]; arr[12] = x + y; arr[13] = x - y;
x = arr[14]; y = arr[15]; arr[14] = x + y; arr[15] = x - y;
}
static void hash_double_fft_16(double *arr)
{
double x,y;
x = arr[0]; y = arr[8]; arr[0] = x + y; arr[8] = x - y;
x = arr[1]; y = arr[9]; arr[1] = x + y; arr[9] = x - y;
x = arr[2]; y = arr[10]; arr[2] = x + y; arr[10] = x - y;
x = arr[3]; y = arr[11]; arr[3] = x + y; arr[11] = x - y;
x = arr[4]; y = arr[12]; arr[4] = x + y; arr[12] = x - y;
x = arr[5]; y = arr[13]; arr[5] = x + y; arr[13] = x - y;
x = arr[6]; y = arr[14]; arr[6] = x + y; arr[14] = x - y;
x = arr[7]; y = arr[15]; arr[7] = x + y; arr[15] = x - y;
x = arr[0]; y = arr[4]; arr[0] = x + y; arr[4] = x - y;
x = arr[1]; y = arr[5]; arr[1] = x + y; arr[5] = x - y;
x = arr[2]; y = arr[6]; arr[2] = x + y; arr[6] = x - y;
x = arr[3]; y = arr[7]; arr[3] = x + y; arr[7] = x - y;
x = arr[8]; y = arr[12]; arr[8] = x + y; arr[12] = x - y;
x = arr[9]; y = arr[13]; arr[9] = x + y; arr[13] = x - y;
x = arr[10]; y = arr[14]; arr[10] = x + y; arr[14] = x - y;
x = arr[11]; y = arr[15]; arr[11] = x + y; arr[15] = x - y;
x = arr[0]; y = arr[2]; arr[0] = x + y; arr[2] = x - y;
x = arr[1]; y = arr[3]; arr[1] = x + y; arr[3] = x - y;
x = arr[4]; y = arr[6]; arr[4] = x + y; arr[6] = x - y;
x = arr[5]; y = arr[7]; arr[5] = x + y; arr[7] = x - y;
x = arr[8]; y = arr[10]; arr[8] = x + y; arr[10] = x - y;
x = arr[9]; y = arr[11]; arr[9] = x + y; arr[11] = x - y;
x = arr[12]; y = arr[14]; arr[12] = x + y; arr[14] = x - y;
x = arr[13]; y = arr[15]; arr[13] = x + y; arr[15] = x - y;
x = arr[0]; y = arr[1]; arr[0] = x + y; arr[1] = x - y;
x = arr[2]; y = arr[3]; arr[2] = x + y; arr[3] = x - y;
x = arr[4]; y = arr[5]; arr[4] = x + y; arr[5] = x - y;
x = arr[6]; y = arr[7]; arr[6] = x + y; arr[7] = x - y;
x = arr[8]; y = arr[9]; arr[8] = x + y; arr[9] = x - y;
x = arr[10]; y = arr[11]; arr[10] = x + y; arr[11] = x - y;
x = arr[12]; y = arr[13]; arr[12] = x + y; arr[13] = x - y;
x = arr[14]; y = arr[15]; arr[14] = x + y; arr[15] = x - y;
}
/*
* This is a precalculated table for generating dotproducts with the random family of vectors directly
* The first vector r_0 is expressed as a hashing function on the dimension index and the other vectors
* are derived from r_0 using an FFT. The table is formed by precalculating the FFT on basis vectors in this table
*/
void lsh_setup_signtable(void)
{
int32 i,j;
int32 arr[16];
char *hibit0ptr;
char *hibit1ptr;
for(i=0;i<16;++i) { /* For each 4-bit position */
hibit0ptr = hash_signtable + i * 16;
hibit1ptr = hash_signtable + (i+16) * 16;
for(j=0;j<16;++j)
arr[j] = 0;
arr[ i ] = 1;
hash_int_fft_16(arr);
for(j=0;j<16;++j) {
if (arr[j] > 0) {
hibit0ptr[j] = '+';
hibit1ptr[j] = '-';
}
else {
hibit0ptr[j] = '-';
hibit1ptr[j] = '+';
}
}
}
}
/*
* Generate a dot product of the hash vector in -vec- with a random family of 16 vectors, { r }
* r_0 is a randomly generated set of +1 -1 coefficients across all the dimensions (indexed by uint32 vec[i].hash)
* The coefficient is calculated as a hashing function from the seed -hashcur- and the index (vec[i].hash),
* so it should be balanced between +1 and -1.
* All the other vectors are generated from an FFT of r_0. This allows the dotproduct with vec to be calculated
* using an FFT if -vec- has many non-zero coefficients. If -vec- has only a few non-zero coefficients,
* the dotproduct if calculated with each vector in the family directly for better efficiency.
* The resulting dotproducts are converted into a 16-long bitvector based on the sign of the dotproduct and
* placed in -bucket-
*/
static uint32 hash_16_dotproduct(uint32 bucket,LSH_ITEM *vec,uint32 vecsize,uint32 hashcur,uint32 vecsizeupper)
{
uint32 i,j;
uint32 rownum;
char *signptr;
double res[16];
for(i=0;i<16;++i)
res[i] = 0.0; /* Initialize the dotproduct results to zero */
if (vecsize < vecsizeupper) { /* If there are a small number of non-zero coefficients in -vec- */
for(i=0;i<vecsize;++i) {
rownum = vec[i].hash ^ hashcur; /* Calculate the rest of the r_0 hashing function*/
rownum = (rownum * 1103515245) + 12345;
rownum = (rownum>>24)&0x1f;
signptr = hash_signtable + rownum * 16;
for(j=0;j<16;++j) { /* Based on the precalculated coeff table calculate this portion of dotproduct */
if (signptr[j] == '+')
res[j] += vec[i].coeff; /* Dot product with +1 coeff */
else
res[j] -= vec[i].coeff; /* Dot product with -1 coeff */
}
}
}
else { /* If we have many non-zero coeffs in -vec- */
for(i=0;i<vecsize;++i) {
rownum = vec[i].hash ^ hashcur; /* Calculate the rest of the r_0 hashing function*/
rownum = (rownum * 1103515245) + 12345;
rownum = (rownum>>24)&0x1f;
if (rownum < 0x10) /* Set-up for the FFT */
res[rownum] += vec[i].coeff;
else
res[rownum&0xf] -= vec[i].coeff;
}
hash_double_fft_16(res); /* Calculate the remaining dotproducts be performing FFT */
}
for(i=0;i<16;++i) { /* Convert the dotproduct results to a bitvector */
bucket <<= 1;
if (res[i] > 0.0)
bucket |= 1;
}
return bucket;
}
void lsh_generate_binids(uint32 *res,LSH_ITEM *vec,uint32 vecsize)
{
uint32 bucket = 0;
int32 bucketcnt = 0;
int32 i,bitsleft;
uint32 curid;
uint32 mask,val;
uint32 hashbase = LSH_HASHBASE;
for(i=0;i<lsh_L;++i) {
curid = i; /* Tack-on bits that indicate the particular table this binid belongs to */
bitsleft = lsh_k;
do {
if (bucketcnt == 0) {
hashbase = (hashbase * 1103515245) + 12345;
bucket = hash_16_dotproduct(bucket,vec,vecsize,hashbase,5);
bucketcnt += 16;
}
if (bucketcnt >= bitsleft) {
curid <<= bitsleft;
mask = 1;
mask = (mask << bitsleft)-1;
val = bucket >> (bucketcnt - bitsleft);
curid |= (val & mask);
bucketcnt -= bitsleft;
bitsleft = 0;
}
else {
curid <<= bucketcnt;
mask = 1;
mask = (mask << bucketcnt)-1;
curid |= (bucket & mask);
bitsleft -= bucketcnt;
bucketcnt = 0;
}
} while(bitsleft > 0);
res[ i ] = curid;
}
}
void lsh_generate_binids_datum(Datum *res,LSH_ITEM *vec,uint32 vecsize)
{
uint32 bucket = 0;
int32 bucketcnt = 0;
int32 i,bitsleft;
uint32 curid;
uint32 mask,val;
uint32 hashbase = LSH_HASHBASE;
for(i=0;i<lsh_L;++i) {
curid = i; /* Tack-on bits that indicate the particular table this binid belongs to */
bitsleft = lsh_k;
do {
if (bucketcnt == 0) {
hashbase = (hashbase * 1103515245) + 12345;
bucket = hash_16_dotproduct(bucket,vec,vecsize,hashbase,5);
bucketcnt += 16;
}
if (bucketcnt >= bitsleft) {
curid <<= bitsleft;
mask = 1;
mask = (mask << bitsleft)-1;
val = bucket >> (bucketcnt - bitsleft);
curid |= (val & mask);
bucketcnt -= bitsleft;
bitsleft = 0;
}
else {
curid <<= bucketcnt;
mask = 1;
mask = (mask << bucketcnt)-1;
curid |= (bucket & mask);
bitsleft -= bucketcnt;
bucketcnt = 0;
}
} while(bitsleft > 0);
res[ i ] = Int32GetDatum((int32)curid);
}
}

View File

@ -0,0 +1,101 @@
/* ###
* IP: GHIDRA
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
#include "lsh.h"
#define CRC_UPDATE(REG,VAL) (crc32tab[ (REG ^ VAL)&0xff ] ^ (REG >> 8))
/* Table for bytewise calculation of a 32-bit Cyclic Redundancy Check */
uint32 crc32tab[] = {
0x0,0x77073096,0xee0e612c,0x990951ba,0x76dc419,0x706af48f,
0xe963a535,0x9e6495a3,0xedb8832,0x79dcb8a4,0xe0d5e91e,
0x97d2d988,0x9b64c2b,0x7eb17cbd,0xe7b82d07,0x90bf1d91,
0x1db71064,0x6ab020f2,0xf3b97148,0x84be41de,0x1adad47d,
0x6ddde4eb,0xf4d4b551,0x83d385c7,0x136c9856,0x646ba8c0,
0xfd62f97a,0x8a65c9ec,0x14015c4f,0x63066cd9,0xfa0f3d63,
0x8d080df5,0x3b6e20c8,0x4c69105e,0xd56041e4,0xa2677172,
0x3c03e4d1,0x4b04d447,0xd20d85fd,0xa50ab56b,0x35b5a8fa,
0x42b2986c,0xdbbbc9d6,0xacbcf940,0x32d86ce3,0x45df5c75,
0xdcd60dcf,0xabd13d59,0x26d930ac,0x51de003a,0xc8d75180,
0xbfd06116,0x21b4f4b5,0x56b3c423,0xcfba9599,0xb8bda50f,
0x2802b89e,0x5f058808,0xc60cd9b2,0xb10be924,0x2f6f7c87,
0x58684c11,0xc1611dab,0xb6662d3d,0x76dc4190,0x1db7106,
0x98d220bc,0xefd5102a,0x71b18589,0x6b6b51f,0x9fbfe4a5,
0xe8b8d433,0x7807c9a2,0xf00f934,0x9609a88e,0xe10e9818,
0x7f6a0dbb,0x86d3d2d,0x91646c97,0xe6635c01,0x6b6b51f4,
0x1c6c6162,0x856530d8,0xf262004e,0x6c0695ed,0x1b01a57b,
0x8208f4c1,0xf50fc457,0x65b0d9c6,0x12b7e950,0x8bbeb8ea,
0xfcb9887c,0x62dd1ddf,0x15da2d49,0x8cd37cf3,0xfbd44c65,
0x4db26158,0x3ab551ce,0xa3bc0074,0xd4bb30e2,0x4adfa541,
0x3dd895d7,0xa4d1c46d,0xd3d6f4fb,0x4369e96a,0x346ed9fc,
0xad678846,0xda60b8d0,0x44042d73,0x33031de5,0xaa0a4c5f,
0xdd0d7cc9,0x5005713c,0x270241aa,0xbe0b1010,0xc90c2086,
0x5768b525,0x206f85b3,0xb966d409,0xce61e49f,0x5edef90e,
0x29d9c998,0xb0d09822,0xc7d7a8b4,0x59b33d17,0x2eb40d81,
0xb7bd5c3b,0xc0ba6cad,0xedb88320,0x9abfb3b6,0x3b6e20c,
0x74b1d29a,0xead54739,0x9dd277af,0x4db2615,0x73dc1683,
0xe3630b12,0x94643b84,0xd6d6a3e,0x7a6a5aa8,0xe40ecf0b,
0x9309ff9d,0xa00ae27,0x7d079eb1,0xf00f9344,0x8708a3d2,
0x1e01f268,0x6906c2fe,0xf762575d,0x806567cb,0x196c3671,
0x6e6b06e7,0xfed41b76,0x89d32be0,0x10da7a5a,0x67dd4acc,
0xf9b9df6f,0x8ebeeff9,0x17b7be43,0x60b08ed5,0xd6d6a3e8,
0xa1d1937e,0x38d8c2c4,0x4fdff252,0xd1bb67f1,0xa6bc5767,
0x3fb506dd,0x48b2364b,0xd80d2bda,0xaf0a1b4c,0x36034af6,
0x41047a60,0xdf60efc3,0xa867df55,0x316e8eef,0x4669be79,
0xcb61b38c,0xbc66831a,0x256fd2a0,0x5268e236,0xcc0c7795,
0xbb0b4703,0x220216b9,0x5505262f,0xc5ba3bbe,0xb2bd0b28,
0x2bb45a92,0x5cb36a04,0xc2d7ffa7,0xb5d0cf31,0x2cd99e8b,
0x5bdeae1d,0x9b64c2b0,0xec63f226,0x756aa39c,0x26d930a,
0x9c0906a9,0xeb0e363f,0x72076785,0x5005713,0x95bf4a82,
0xe2b87a14,0x7bb12bae,0xcb61b38,0x92d28e9b,0xe5d5be0d,
0x7cdcefb7,0xbdbdf21,0x86d3d2d4,0xf1d4e242,0x68ddb3f8,
0x1fda836e,0x81be16cd,0xf6b9265b,0x6fb077e1,0x18b74777,
0x88085ae6,0xff0f6a70,0x66063bca,0x11010b5c,0x8f659eff,
0xf862ae69,0x616bffd3,0x166ccf45,0xa00ae278,0xd70dd2ee,
0x4e048354,0x3903b3c2,0xa7672661,0xd06016f7,0x4969474d,
0x3e6e77db,0xaed16a4a,0xd9d65adc,0x40df0b66,0x37d83bf0,
0xa9bcae53,0xdebb9ec5,0x47b2cf7f,0x30b5ffe9,0xbdbdf21c,
0xcabac28a,0x53b39330,0x24b4a3a6,0xbad03605,0xcdd70693,
0x54de5729,0x23d967bf,0xb3667a2e,0xc4614ab8,0x5d681b02,
0x2a6f2b94,0xb40bbe37,0xc30c8ea1,0x5a05df1b,0x2d02ef8d };
uint64 lsh_hash_internal(LSHVECTOR *vec)
{
uint32 reg1,reg2;
uint32 curtf,curhash,oldreg1;
uint32 i;
uint64 res;
reg1 = 0x12CF93AB;
reg2 = 0xEE39B2D6;
for(i=0;i<vec->numitems;++i) {
curtf = vec->items[i].tf;
curhash = vec->items[i].hash;
oldreg1 = reg1;
reg1 = CRC_UPDATE(reg1,curtf);
reg1 = CRC_UPDATE(reg1,curhash);
reg1 = CRC_UPDATE(reg1,(reg2>>24));
reg2 = CRC_UPDATE(reg2,(oldreg1>>24));
reg2 = CRC_UPDATE(reg2,(curhash>>8));
reg2 = CRC_UPDATE(reg2,(curhash>>16));
reg2 = CRC_UPDATE(reg2,(curhash>>24));
}
res = reg1;
res <<= 32;
res |= reg2;
return res;
}

View File

@ -0,0 +1,414 @@
/* ###
* IP: GHIDRA
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
#include "lsh.h"
#include "fmgr.h"
#include "funcapi.h"
#include "access/htup_details.h"
#include "access/gin.h"
#include "libpq/pqformat.h"
#include <ctype.h>
PG_MODULE_MAGIC;
void _PG_init(void);
PG_FUNCTION_INFO_V1(lshvector_in);
PG_FUNCTION_INFO_V1(lshvector_out);
PG_FUNCTION_INFO_V1(lshvector_send);
PG_FUNCTION_INFO_V1(lshvector_recv);
PG_FUNCTION_INFO_V1(lshvector_hash);
PG_FUNCTION_INFO_V1(lshvector_compare);
PG_FUNCTION_INFO_V1(lshvector_overlap);
PG_FUNCTION_INFO_V1(lshvector_gin_extract_value);
PG_FUNCTION_INFO_V1(lshvector_gin_extract_query);
PG_FUNCTION_INFO_V1(lshvector_gin_consistent);
PG_FUNCTION_INFO_V1(lsh_load);
PG_FUNCTION_INFO_V1(lsh_reload);
PG_FUNCTION_INFO_V1(lsh_getweight);
Datum lshvector_in(PG_FUNCTION_ARGS);
Datum lshvector_out(PG_FUNCTION_ARGS);
Datum lshvector_send(PG_FUNCTION_ARGS);
Datum lshvector_recv(PG_FUNCTION_ARGS);
Datum lshvector_hash(PG_FUNCTION_ARGS);
Datum lshvector_compare(PG_FUNCTION_ARGS);
Datum lshvector_overlap(PG_FUNCTION_ARGS);
Datum lshvector_gin_extract_value(PG_FUNCTION_ARGS);
Datum lshvector_gin_extract_query(PG_FUNCTION_ARGS);
Datum lshvector_gin_consistent(PG_FUNCTION_ARGS);
Datum lsh_load(PG_FUNCTION_ARGS);
Datum lsh_reload(PG_FUNCTION_ARGS);
Datum lsh_getweight(PG_FUNCTION_ARGS);
/*
* Allocate memory for an LSHVECTOR given the raw count of the number of hash entries in the vector
*/
static LSHVECTOR *allocate_lshvector(uint32 numentries)
{
LSHVECTOR *out;
uint32 maxitems, commonlen;
/* Maximum number of hashes in a single LSHVECTOR assuming a 1 gigabyte allocation limit */
maxitems = (0x3fffffff - HDRSIZELSH) / sizeof(LSH_ITEM);
if (numentries > maxitems) {
ereport(ERROR,(errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),errmsg("Exceeded maximum entries for single lshvector")));
/* Does not return */
}
commonlen = HDRSIZELSH + numentries * sizeof(LSH_ITEM);
out = (LSHVECTOR *) palloc(commonlen);
SET_VARSIZE(out,commonlen);
return out;
}
void _PG_init(void)
{
lsh_initialize();
}
Datum lsh_load(PG_FUNCTION_ARGS)
{
if (!weights_loaded) {
lsh_load_weights();
lsh_load_lookuptable();
lsh_load_binconfig();
weights_loaded = true;
}
PG_RETURN_INT32(0);
}
Datum lsh_reload(PG_FUNCTION_ARGS)
{
lsh_load_weights();
lsh_load_lookuptable();
lsh_load_binconfig();
weights_loaded = true;
PG_RETURN_INT32(0);
}
Datum lsh_getweight(PG_FUNCTION_ARGS)
{
LSHVECTOR *vec = PG_GETARG_LSHVECTOR_P(0);
uint32 arg = PG_GETARG_UINT32(1);
double res;
if (arg >= vec->numitems)
res = 0.0;
else
res = vec->items[arg].coeff;
PG_FREE_IF_COPY(vec,0);
PG_RETURN_FLOAT8( res );
}
/*
* text input
*/
Datum
lshvector_in(PG_FUNCTION_ARGS)
{
char *buf = (char *) PG_GETARG_POINTER(0);
char *ptr,*ptrstart;
LSHVECTOR *vec;
uint32 numitems = 0;
uint32 commacount = 0;
uint32 i,j;
int32 val;
char curc;
ptr = buf;
curc = '\0';
while(*ptr) {
curc = *ptr;
if (isspace(curc)==0) break;
++ptr;
}
if (curc != '(')
ereport(ERROR,(errcode(ERRCODE_SYNTAX_ERROR),errmsg("Missing opening '('"))); /* Does not return */
++ptr;
ptrstart = ptr;
while (*ptr) {
curc = *ptr;
if (curc == ':')
numitems += 1;
else if (curc == ',')
commacount += 1;
else if (curc == ')')
break;
++ptr;
}
if ((curc != ')')||(numitems != commacount+1))
ereport(ERROR,(errcode(ERRCODE_SYNTAX_ERROR),errmsg("Bad delimiters"))); /* Does not return */
vec = allocate_lshvector(numitems);
ptr = ptrstart;
i = 0;
j = 0;
while(*ptr) {
val = strtol(ptr,&ptr,16);
if (j==0) {
if ((val<1)||(val>64)) {
pfree(vec);
ereport(ERROR,(errcode(ERRCODE_SYNTAX_ERROR),errmsg("Term frequency count out of bounds"))); /* Does not return */
}
vec->items[i].tf = (uint16)val;
j = 1;
}
else {
vec->items[i].hash = (uint32)val;
vec->items[i].idf = 0;
j = 0;
i += 1;
}
while(isspace( *ptr ))
ptr++;
if (*ptr == ')') break;
if (*ptr == ':') {
if (j==0) {
pfree(vec);
ereport(ERROR,(errcode(ERRCODE_SYNTAX_ERROR),errmsg("Expected ','"))); /* Does not return */
}
ptr++;
}
else if (*ptr == ',') {
if (j==1) {
pfree(vec);
ereport(ERROR,(errcode(ERRCODE_SYNTAX_ERROR),errmsg("Expected ':'"))); /* Does not return */
}
ptr++;
}
}
vec->numitems = numitems;
lsh_calc_weights(vec);
PG_RETURN_POINTER(vec);
}
/*
* text output
*/
Datum
lshvector_out(PG_FUNCTION_ARGS)
{
LSHVECTOR *vec = PG_GETARG_LSHVECTOR_P(0);
StringInfoData buf;
uint32 i,sz;
initStringInfo(&buf);
appendStringInfoChar(&buf,'(');
sz = vec->numitems;
for(i=0;i<sz;++i) {
appendStringInfo(&buf,"%x",(int32)vec->items[i].tf);
appendStringInfoChar(&buf,':');
appendStringInfo(&buf,"%x",(int32)vec->items[i].hash);
if (i+1 < sz)
appendStringInfoChar(&buf,',');
}
appendStringInfoChar(&buf,')');
PG_FREE_IF_COPY(vec,0);
PG_RETURN_CSTRING(buf.data);
}
/*
* binary output
*/
Datum
lshvector_send(PG_FUNCTION_ARGS)
{
LSHVECTOR *vec = PG_GETARG_LSHVECTOR_P(0);
uint32 i;
uint32 numitems;
StringInfoData buf;
numitems = vec->numitems;
pq_begintypsend(&buf);
pq_sendint(&buf,numitems,4);
for(i=0;i<numitems;++i) {
pq_sendint(&buf,vec->items[i].tf,1);
pq_sendint(&buf,vec->items[i].hash,4);
}
PG_FREE_IF_COPY(vec,0);
PG_RETURN_BYTEA_P(pq_endtypsend(&buf));
}
/*
* binary input
*/
Datum
lshvector_recv(PG_FUNCTION_ARGS)
{
LSHVECTOR *out;
StringInfo buf = (StringInfo) PG_GETARG_POINTER(0);
uint32 numitems;
uint32 tf;
uint32 i;
numitems = pq_getmsgint(buf,4);
out = allocate_lshvector(numitems);
out->numitems = numitems;
for(i=0;i<numitems;++i) {
tf = pq_getmsgint(buf,1);
if ((tf<1)||(tf>64)) {
pfree(out);
ereport(ERROR,(errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),errmsg("Term frequency is out of range")));
/* Does not return */
}
out->items[i].tf = tf;
out->items[i].hash = pq_getmsgint(buf,4);
}
lsh_calc_weights(out);
PG_RETURN_POINTER(out);
}
Datum lshvector_hash(PG_FUNCTION_ARGS)
{
LSHVECTOR *a = PG_GETARG_LSHVECTOR_P(0);
int64 res = (int64)lsh_hash_internal(a);
PG_FREE_IF_COPY(a,0);
PG_RETURN_INT64(res);
}
Datum lshvector_compare(PG_FUNCTION_ARGS)
{
LSHVECTOR *a = PG_GETARG_LSHVECTOR_P(0);
LSHVECTOR *b = PG_GETARG_LSHVECTOR_P(1);
TupleDesc tupdesc;
TupleDesc bless;
HeapTuple restuple;
Datum dvalues[2];
bool nulls[2] = {false, false};
double sim,sig;
sim = lsh_compare_internal(a,b,&sig);
PG_FREE_IF_COPY(a,0);
PG_FREE_IF_COPY(b,1);
if (get_call_result_type(fcinfo,NULL,&tupdesc) != TYPEFUNC_COMPOSITE)
elog(ERROR,"Could not get composite row type to return");
bless = BlessTupleDesc(tupdesc);
dvalues[0] = Float8GetDatum(sim);
dvalues[1] = Float8GetDatum(sig);
restuple = heap_form_tuple(bless,dvalues,nulls);
return HeapTupleGetDatum(restuple);
}
/*
* This is the actual operator function being accelerated by the gin index. In truth, the index itself
* defines the operator, so the commented out code below emulates the indexes key generation process and
* looks for overlap in the keys between two vectors. In practice, any query that invokes this operator
* will hopefully be going through the index and so doesn't need to evaluate this function. For
* cases where postgresql does a recheck after going through the index, there is no query that doesn't send
* the results of the operator test to a similarity filter. So there is no reason to actually perform
* the overlap test. So we just implement a NOP return that always returns true.
*/
Datum lshvector_overlap(PG_FUNCTION_ARGS)
{
/* bool res; */
/* int32 i; */
/* LSHVECTOR *a = PG_GETARG_LSHVECTOR_P(0); */
/* LSHVECTOR *b = PG_GETARG_LSHVECTOR_P(1); */
/* uint32 *bina = (uint32 *)palloc( sizeof(uint32) * lsh_L ); */
/* uint32 *binb = (uint32 *)palloc( sizeof(uint32) * lsh_L ); */
/* lsh_generate_binids(bina,a->items,a->numitems); */
/* lsh_generate_binids(binb,b->items,b->numitems); */
/* PG_FREE_IF_COPY(a,0); */
/* PG_FREE_IF_COPY(b,1); */
/* res = false; /\* Assume no overlap *\/ */
/* for(i=0;i<lsh_L;++i) { */
/* if (bina[i] == binb[i]) { */
/* res = true; /\* We found an overlap, (only need one) *\/ */
/* break; */
/* } */
/* } */
/* pfree(bina); */
/* pfree(binb); */
PG_RETURN_BOOL(true);
}
Datum lshvector_gin_extract_value(PG_FUNCTION_ARGS)
{
LSHVECTOR *a = PG_GETARG_LSHVECTOR_P(0);
int32 *nkeys = (int32 *) PG_GETARG_POINTER(1);
Datum *entries = (Datum *)palloc( sizeof(Datum) * lsh_L );
lsh_generate_binids_datum(entries,a->items,a->numitems);
PG_FREE_IF_COPY(a,0);
*nkeys = lsh_L;
PG_RETURN_POINTER(entries);
}
Datum lshvector_gin_extract_query(PG_FUNCTION_ARGS)
{
LSHVECTOR *a = PG_GETARG_LSHVECTOR_P(0);
int32 *nkeys = (int32 *) PG_GETARG_POINTER(1);
/* StrategyNumber strategy = PG_GETARG_UINT16(2); */
/* bool **pmatch = (bool **) PG_GETARG_POINTER(3); */
/* Pointer **extra_data = (Pointer **) PG_GETARG_POINTER(4); */
/* bool **nullFlags = (bool **) PG_GETARG_POINTER(5); */
/* int32 *searchMode = (int32 *) PG_GETARG_POINTER(6); */
Datum *entries = (Datum *)palloc( sizeof(Datum) * lsh_L );
lsh_generate_binids_datum(entries,a->items,a->numitems);
PG_FREE_IF_COPY(a,0);
*nkeys = lsh_L;
PG_RETURN_POINTER(entries);
}
Datum lshvector_gin_consistent(PG_FUNCTION_ARGS)
{
bool *check = (bool *) PG_GETARG_POINTER(0);
/* StrategyNumber strategy = PG_GETARG_UINT16(1); */
/* LSHVECTOR *a = PG_GETARG_LSHVECTOR_P(2); */
int32 nkeys = PG_GETARG_INT32(3);
/* Pointer *extra_data = (Pointer *) PG_GETARG_POINTER(4); */
bool *recheck = (bool *) PG_GETARG_POINTER(5);
bool res = false;
int32 i;
*recheck = false; /* The operator does NOT need to be recalculated, this routine should exactly match */
for(i=0;i<nkeys;++i) {
if (check[i]) { /* If ANY hash is present in the indexed lshvector */
res = true; /* this is considered an overlap */
break; /* and we don't need to look any further */
}
}
PG_RETURN_BOOL(res);
}

View File

@ -0,0 +1,60 @@
/* ###
* IP: GHIDRA
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
#ifndef __LSH_H__
#define __LSH_H__
#include "postgres.h"
typedef struct
{
uint32 hash; /* A specific hash */
uint16 tf; /* Associated hash(term) frequency */
uint16 idf; /* Inverse Document Frequency */
double coeff; /* The actual weight of this hash as a coefficient */
} LSH_ITEM;
typedef struct
{
int32 vl_len_; /* varlena header (do not touch directly!) */
uint32 numitems;
uint32 hashcount; /* Total number of hashes counting multiplicity */
double length; /* Length of vector */
LSH_ITEM items[1];
} LSHVECTOR;
#define HDRSIZELSH offsetof(LSHVECTOR,items)
#define DatumGetLshVectorP(X) ((LSHVECTOR *) PG_DETOAST_DATUM(X))
#define PG_GETARG_LSHVECTOR_P(n) DatumGetLshVectorP(PG_GETARG_DATUM(n))
extern int32 lsh_k;
extern int32 lsh_L;
extern uint32 crc32tab[];
extern bool weights_loaded;
extern void lsh_calc_weights(LSHVECTOR *vec);
extern void lsh_initialize(void);
extern void lsh_load_weights(void);
extern void lsh_load_lookuptable(void);
extern uint64 lsh_hash_internal(LSHVECTOR *vec);
extern double lsh_compare_internal(LSHVECTOR *a,LSHVECTOR *b,double *sig);
extern void lsh_setup_signtable(void);
extern void lsh_load_binconfig(void);
extern void lsh_generate_binids(uint32 *res,LSH_ITEM *vec,uint32 vecsize);
extern void lsh_generate_binids_datum(Datum *res,LSH_ITEM *vec,uint32 vecsize);
#endif

View File

@ -0,0 +1,476 @@
/* ###
* IP: GHIDRA
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
#include "lsh.h"
#include "fmgr.h"
#include "executor/spi.h"
#include "utils/memutils.h"
#include <math.h>
#define LSH_IDFSIZE 512
#define LSH_TFSIZE 64
#define LSH_MAX_HASHENTRIES 1048576
#define LSH_MAX_K 31
#define LSH_MAX_L 1024
#define LSH_DEFAULT_K 17
#define LSH_DEFAULT_L 146
int32 lsh_k; /* Number of bits in a binid */
int32 lsh_L; /* Number of binnings */
static double lsh_idfweight[LSH_IDFSIZE]; /* Sorted weights least -> most probable for Inverse Document Freq */
static double lsh_tfweight[LSH_TFSIZE]; /* Sorted weights least -> most probable for Term Frequency */
static double lsh_weightnorm; /* Normalization of idf weights over raw log(probability) */
static double lsh_probflip0; /* Significance penalty for hash flips */
static double lsh_probflip1;
static double lsh_probdiff0; /* Significance penalty for length differences */
static double lsh_probdiff1;
static double lsh_scale; /* Final scaling for significance scoring */
static double lsh_addend;
static double lsh_probflip0_norm;
static double lsh_probflip1_norm;
static double lsh_probdiff0_norm;
static double lsh_probdiff1_norm;
typedef struct {
uint32 hash;
uint32 count;
} IDFEntry;
static MemoryContext lsh_mem_ctx;
static uint32 lsh_IDFTableMask; /* mask for hash table computation */
static IDFEntry *lsh_IDFTable = NULL; /* The IDFLookup table */
bool weights_loaded = false;
static void update_norms(void)
{
int32 i;
double scale_sqrt = sqrt(lsh_scale);
lsh_probflip0_norm = lsh_probflip0 * lsh_scale;
lsh_probflip1_norm = lsh_probflip1 * lsh_scale;
lsh_probdiff0_norm = lsh_probdiff0 * lsh_scale;
lsh_probdiff1_norm = lsh_probdiff1 * lsh_scale;
lsh_weightnorm = lsh_weightnorm / lsh_scale;
for(i=0;i<LSH_IDFSIZE;++i) {
lsh_idfweight[i] *= scale_sqrt;
}
}
/*
* Load the IDF and TF weights and other scaling info from the table 'weighttable'
* If the table isn't present, return false
* This assumes the existence of a table with LSH_IDFSIZE + LSH_TFSIZE + 7 row constructed with
* CREATE TABLE weighttable (id integer,weight double precision);
*/
static bool load_weights_from_table(void)
{
SPITupleTable *spi_tuptable;
TupleDesc spi_tupdesc;
uint64 i,proc;
int32 ret;
char *resstring;
int32 resindex;
double resweight;
ret = SPI_connect();
if (ret < 0)
elog(ERROR,"lshvector load_weights_from_table: SPI_connect returned %d",ret);
/* Check for the existence of weighttable */
ret = SPI_execute("SELECT relname from pg_class where relname='weighttable';",true,0);
proc = SPI_processed;
if ((ret != SPI_OK_SELECT)||(proc != 1)) {
elog(WARNING,"lshvector load_weights_from_table: weighttable not present - using default weights");
SPI_finish();
return false;
}
ret = SPI_execute("SELECT ALL * from weighttable;",true,0); /* Read(only) all rows from table */
proc = SPI_processed;
if ((ret != SPI_OK_SELECT)||(proc != (LSH_IDFSIZE+LSH_TFSIZE + 7))) {
elog(WARNING,"lshvector load_weights_from_table: weighttable has incorrect length - reverting to default weights");
SPI_finish();
return false;
}
spi_tupdesc = SPI_tuptable->tupdesc;
spi_tuptable = SPI_tuptable;
for(i=0;i<proc;++i) {
HeapTuple tuple = spi_tuptable->vals[i];
resstring = SPI_getvalue(tuple, spi_tupdesc, 1); /* Column numbers start at 1 */
resindex = strtol(resstring,NULL,10);
pfree(resstring);
resstring = SPI_getvalue(tuple, spi_tupdesc, 2);
resweight = atof( resstring );
pfree(resstring);
if (resindex < LSH_IDFSIZE)
lsh_idfweight[resindex] = resweight;
else if (resindex < LSH_IDFSIZE + LSH_TFSIZE)
lsh_tfweight[resindex - LSH_IDFSIZE] = resweight;
else if (resindex == (LSH_IDFSIZE + LSH_TFSIZE))
lsh_weightnorm = resweight;
else if (resindex == (LSH_IDFSIZE + LSH_TFSIZE + 1))
lsh_probflip0 = resweight;
else if (resindex == (LSH_IDFSIZE + LSH_TFSIZE + 2))
lsh_probflip1 = resweight;
else if (resindex == (LSH_IDFSIZE + LSH_TFSIZE + 3))
lsh_probdiff0 = resweight;
else if (resindex == (LSH_IDFSIZE + LSH_TFSIZE + 4))
lsh_probdiff1 = resweight;
else if (resindex == (LSH_IDFSIZE + LSH_TFSIZE + 5))
lsh_scale = resweight;
else if (resindex == (LSH_IDFSIZE + LSH_TFSIZE + 6))
lsh_addend = resweight;
else {
SPI_finish();
return false;
}
}
SPI_finish();
update_norms();
return true;
}
void lsh_load_weights(void)
{
int32 i;
if (load_weights_from_table()) /* Try to get weights from table */
return;
/* Provide some sort of reasonable default */
for(i=0;i<LSH_IDFSIZE;++i)
lsh_idfweight[i] = 1.0;
for(i=0;i<LSH_TFSIZE;++i)
lsh_tfweight[i] = 1.0;
lsh_weightnorm = 13.0;
lsh_probflip0 = 0.2;
lsh_probflip1 = 20.0;
lsh_probdiff0 = 0.2;
lsh_probdiff1 = 20.0;
lsh_scale = 1.0;
lsh_addend = 0.0;
update_norms();
}
static void initialize_idflookup_hashtable(uint32 size)
{
uint32 i;
MemoryContext oldctx;
lsh_IDFTableMask = 1;
while( lsh_IDFTableMask < size )
lsh_IDFTableMask <<= 1;
lsh_IDFTableMask <<= 1;
oldctx = MemoryContextSwitchTo(lsh_mem_ctx);
lsh_IDFTable = (IDFEntry *) palloc(sizeof(IDFEntry) * lsh_IDFTableMask);
for(i=0;i<lsh_IDFTableMask;++i) {
lsh_IDFTable[i].count = 0xffffffff; /* Mark all the slots as empty */
}
lsh_IDFTableMask -= 1;
MemoryContextSwitchTo(oldctx);
}
static void insert_idflookup_hash(uint32 hash,uint32 count)
{
IDFEntry *ptr;
uint32 val = hash & lsh_IDFTableMask;
for(;;) {
ptr = lsh_IDFTable + val;
if (ptr->count == 0xffffffff) /* Found an empty slot */
break;
val = (val + 1) & lsh_IDFTableMask;
}
ptr->hash = hash;
ptr->count = count;
}
static uint32 get_idflookup_count(uint32 hash)
{
uint32 val;
IDFEntry *ptr;
if (lsh_IDFTableMask == 0)
return 0;
val = hash & lsh_IDFTableMask;
for(;;) {
ptr = lsh_IDFTable + val;
if (ptr->count == 0xffffffff) break; /* Is slot empty */
if (ptr->hash == hash)
return ptr->count;
val = (val + 1) & lsh_IDFTableMask;
}
return 0; /* Entry is not in the table (assume 0 count) */
}
/*
* Based on hash and existing idf and tf counts, calculate the final coefficient
* Also calculate the vector length and hashcount
*/
void lsh_calc_weights(LSHVECTOR *vec)
{
uint32 i;
LSH_ITEM *ptr;
uint32 idf;
double length = 0.0;
double coeff;
uint32 tf;
uint32 hashcount = 0;
ptr = vec->items;
for(i=0;i<vec->numitems;++i) {
idf = get_idflookup_count(ptr[i].hash);
ptr[i].idf = idf;
tf = ptr[i].tf;
coeff = lsh_idfweight[idf] * lsh_tfweight[ tf - 1 ];
ptr[i].coeff = coeff;
length += coeff * coeff;
hashcount += tf;
}
vec->length = sqrt(length);
vec->hashcount = hashcount;
}
/* Load the most common IDF hashes for lookup and weight generation from the table 'idflookup'
* If the table isn't present, return false
* This assumes the existence of a table with (approximately) 1000 rows constructed with
* CREATE TABLE idflookup( hash bigint, lookup integer);
*/
static bool load_idflookup_from_table(void)
{
SPITupleTable *spi_tuptable;
TupleDesc spi_tupdesc;
uint64 i,proc;
int32 ret;
char *resstring;
uint32 rescount;
uint32 reshash;
ret = SPI_connect();
if (ret < 0)
elog(ERROR,"lshvector load_idflookup_from_table: SPI_connect returned %d",ret);
/* Check for the existence of idflookup */
ret = SPI_execute("SELECT relname from pg_class where relname='idflookup';",true,0);
proc = SPI_processed;
if ((ret != SPI_OK_SELECT)||(proc != 1)) {
elog(WARNING,"lshvector load_idflookup_from_table: No IDF hashes present");
SPI_finish();
return false;
}
ret = SPI_execute("SELECT ALL * from idflookup;",true,0); /* Read(only) all rows from table */
proc = SPI_processed;
if ((ret != SPI_OK_SELECT)||(proc <= 1)||(proc > LSH_MAX_HASHENTRIES)) {
elog(WARNING,"lshvector load_idflookup_from_table: idflookup has invalid size: IDF hashes not loaded");
SPI_finish();
return false;
}
initialize_idflookup_hashtable((uint32)proc); /* Allocate the hashtable to hold entries for each row */
spi_tupdesc = SPI_tuptable->tupdesc;
spi_tuptable = SPI_tuptable;
for(i=0;i<proc;++i) {
HeapTuple tuple = spi_tuptable->vals[i];
resstring = SPI_getvalue(tuple, spi_tupdesc, 1); /* Column numbers start at 1 */
reshash = strtoul(resstring,NULL,10);
pfree(resstring);
resstring = SPI_getvalue(tuple, spi_tupdesc, 2);
rescount = strtoul(resstring,NULL,10);
pfree(resstring);
insert_idflookup_hash(reshash,rescount);
}
SPI_finish();
return true;
}
void lsh_load_binconfig(void)
{ /* Load the k and L parameters from the database */
SPITupleTable *spi_tuptable;
TupleDesc spi_tupdesc;
uint64 proc;
int32 ret;
char *resstring;
HeapTuple tuple;
ret = SPI_connect();
if (ret < 0)
elog(ERROR,"lshvector lsh_load_binconfig: SPI_connect returned %d",ret);
/* Check for the existence of keyvaluetable */
ret = SPI_execute("SELECT relname from pg_class where relname='keyvaluetable';",true,0);
proc = SPI_processed;
if ((ret != SPI_OK_SELECT)||(proc != 1)) {
SPI_finish();
lsh_k = LSH_DEFAULT_K; /* Reasonable defaults if configuration parameters don't exist */
lsh_L = LSH_DEFAULT_L;
return;
}
/* Get the 'k' value */
ret = SPI_execute("SELECT value FROM keyvaluetable WHERE key='k';",true,0);
proc = SPI_processed;
if ((ret != SPI_OK_SELECT)||(proc != 1))
elog(ERROR,"lshvector lsh_load_binconfig: Could not load 'k' value from keyvaluetable");
spi_tupdesc = SPI_tuptable->tupdesc;
spi_tuptable = SPI_tuptable;
tuple = spi_tuptable->vals[0];
resstring = SPI_getvalue(tuple,spi_tupdesc, 1); /* First column */
lsh_k = strtoul(resstring,NULL,10);
pfree(resstring);
/* Get the 'L' value */
ret = SPI_execute("SELECT value FROM keyvaluetable WHERE key='L';",true,0);
proc = SPI_processed;
if ((ret != SPI_OK_SELECT)||(proc != 1))
elog(ERROR,"lshvector lsh_load_binconfig: Could not load 'L' value from keyvaluetable");
spi_tupdesc = SPI_tuptable->tupdesc;
spi_tuptable = SPI_tuptable;
tuple = spi_tuptable->vals[0];
resstring = SPI_getvalue(tuple,spi_tupdesc, 1); /* First column */
lsh_L = strtoul(resstring,NULL,10);
pfree(resstring);
SPI_finish();
if (lsh_k < 1 || lsh_k > LSH_MAX_K || lsh_L < 1 || lsh_L > LSH_MAX_L)
elog(ERROR,"lshvector lsh_load_binconfig: Invalid k and L settings");
}
void lsh_load_lookuptable(void)
{
if (lsh_IDFTable != NULL) {
pfree(lsh_IDFTable);
lsh_IDFTable = NULL;
}
if (load_idflookup_from_table())
return;
if (lsh_IDFTable != NULL) {
pfree(lsh_IDFTable);
lsh_IDFTable = NULL;
}
lsh_IDFTableMask = 0; /* Default lookup, always return 0 */
}
/* Initialize the weight system, the first time the extension is loaded */
void lsh_initialize(void)
{
lsh_mem_ctx = AllocSetContextCreate(TopMemoryContext,
"IDF weights lookup table",
ALLOCSET_DEFAULT_MINSIZE,
ALLOCSET_DEFAULT_INITSIZE,
ALLOCSET_DEFAULT_MAXSIZE);
lsh_IDFTable = NULL;
weights_loaded = false;
lsh_setup_signtable();
}
double lsh_compare_internal(LSHVECTOR *a,LSHVECTOR *b,double *sig)
{
double res = 0.0;
double dotproduct;
int32 intersectcount = 0;
uint32 hash1,hash2;
LSH_ITEM *aptr,*aend,*bptr,*bend;
int32 t1,t2;
double w1,w2;
uint32 numflip,diff,min,max;
aptr = a->items;
aend = aptr + a->numitems;
bptr = b->items;
bend = bptr + b->numitems;
if ((aptr != aend)&&(bptr != bend)) {
hash1 = aptr->hash;
hash2 = bptr->hash;
for(;;) {
if (hash1 == hash2) {
t1 = aptr->tf;
t2 = bptr->tf;
if (t1 < t2) { /* a has the smallest number of terms with same hash */
w1 = aptr->coeff; /* Use a weight */
res += w1 * w1;
intersectcount += t1; /* All of a terms are in the intersection, count them */
}
else {
w2 = bptr->coeff; /* Use b weight */
res += w2 * w2;
intersectcount += t2; /* All of b terms are in the intersection, count them */
}
aptr++;
bptr++;
if (aptr == aend) break;
if (bptr == bend) break;
hash1 = aptr->hash;
hash2 = bptr->hash;
}
else if (hash1 < hash2) {
aptr++;
if (aptr == aend) break;
hash1 = aptr->hash;
}
else { /* hash1 > hash2 */
bptr++;
if (bptr == bend) break;
hash2 = bptr->hash;
}
}
dotproduct = res;
res /= (a->length * b->length);
}
else
dotproduct = res;
if (a->hashcount < b->hashcount) {
min = a->hashcount; /* Smallest vector is a */
max = b->hashcount;
}
else {
min = b->hashcount;
max = a->hashcount;
}
diff = max - min; /* Subtract to get a positive difference */
numflip = min - intersectcount;
*sig = dotproduct - numflip * (lsh_probflip0_norm + lsh_probflip1_norm/max)
- diff * (lsh_probdiff0_norm + lsh_probdiff1_norm/max) + lsh_addend;
return res;
}

View File

@ -0,0 +1,107 @@
-- complain if script is sourced in psql, rather than via CREATE EXTENSION
\echo Use "CREATE EXTENSION lshvector" to load this file. \quit
-- Create user-defined type for feature vector
CREATE FUNCTION lshvector_in(cstring)
RETURNS lshvector
AS 'MODULE_PATHNAME'
LANGUAGE C STABLE STRICT;
-- Stable because of configurable weights
CREATE FUNCTION lshvector_out(lshvector)
RETURNS cstring
AS 'MODULE_PATHNAME'
LANGUAGE C IMMUTABLE STRICT;
CREATE FUNCTION lshvector_recv(internal)
RETURNS lshvector
AS 'MODULE_PATHNAME'
LANGUAGE C STABLE STRICT;
-- Stable because of configurable weights
CREATE FUNCTION lshvector_send(lshvector)
RETURNS bytea
AS 'MODULE_PATHNAME'
LANGUAGE C IMMUTABLE STRICT;
CREATE FUNCTION lshvector_hash(lshvector)
RETURNS int8
AS 'MODULE_PATHNAME'
LANGUAGE C IMMUTABLE STRICT;
CREATE FUNCTION lsh_load()
RETURNS int4
AS 'MODULE_PATHNAME'
LANGUAGE C STRICT;
CREATE FUNCTION lsh_reload()
RETURNS int4
AS 'MODULE_PATHNAME'
LANGUAGE C STRICT;
CREATE FUNCTION lsh_getweight(lshvector)
RETURNS float8
AS 'MODULE_PATHNAME'
LANGUAGE C IMMUTABLE STRICT;
CREATE TYPE lshvector (
INTERNALLENGTH = variable,
INPUT = lshvector_in,
OUTPUT = lshvector_out,
RECEIVE = lshvector_recv,
SEND = lshvector_send,
ALIGNMENT = double,
STORAGE = external
);
CREATE TYPE lshvector_comptype AS (
sim DOUBLE PRECISION,
sig DOUBLE PRECISION
);
CREATE FUNCTION lshvector_compare(lshvector,lshvector)
RETURNS lshvector_comptype
AS 'MODULE_PATHNAME'
LANGUAGE C IMMUTABLE STRICT;
CREATE FUNCTION lshvector_overlap(lshvector,lshvector)
RETURNS bool
AS 'MODULE_PATHNAME'
LANGUAGE C STABLE STRICT;
CREATE FUNCTION lshvector_gin_extract_value(lshvector,internal)
RETURNS internal
AS 'MODULE_PATHNAME'
LANGUAGE C STABLE STRICT;
CREATE FUNCTION lshvector_gin_extract_query(lshvector,internal,int2,internal,internal,internal,internal)
RETURNS internal
AS 'MODULE_PATHNAME'
LANGUAGE C STABLE STRICT;
CREATE FUNCTION lshvector_gin_consistent(internal, int2, lshvector, int4, internal, internal, internal, internal)
RETURNS bool
AS 'MODULE_PATHNAME'
LANGUAGE C IMMUTABLE STRICT;
CREATE OPERATOR % (
LEFTARG = lshvector,
RIGHTARG = lshvector,
PROCEDURE = lshvector_overlap,
COMMUTATOR = '%',
RESTRICT = contsel,
JOIN = contjoinsel
);
CREATE OPERATOR CLASS gin_lshvector_ops
FOR TYPE lshvector USING gin
AS
OPERATOR 1 % (lshvector,lshvector),
FUNCTION 1 btint4cmp (int4,int4),
FUNCTION 2 lshvector_gin_extract_value (lshvector,internal),
FUNCTION 3 lshvector_gin_extract_query (lshvector,internal,int2,internal,internal,internal,internal),
FUNCTION 4 lshvector_gin_consistent (internal,int2,lshvector,int4,internal,internal,internal,internal),
STORAGE int4;

View File

@ -0,0 +1,6 @@
# Locality Sensitive Hashing extension
comment = 'a feature vector type and a locality sensitive hashing index'
default_version = '1.0'
module_pathname = '$libdir/lshvector'
superuser = false
relocatable = true

View File

@ -0,0 +1,175 @@
<?xml version='1.0' encoding='ISO-8859-1' ?>
<!--
This is an XML file intended to be parsed by the Ghidra help system. It is loosely based
upon the JavaHelp table of contents document format. The Ghidra help system uses a
TOC_Source.xml file to allow a module with help to define how its contents appear in the
Ghidra help viewer's table of contents. The main document (in the Base module)
defines a basic structure for the
Ghidra table of contents system. Other TOC_Source.xml files may use this structure to insert
their files directly into this structure (and optionally define a substructure).
In this document, a tag can be either a <tocdef> or a <tocref>. The former is a definition
of an XML item that may have a link and may contain other <tocdef> and <tocref> children.
<tocdef> items may be referred to in other documents by using a <tocref> tag with the
appropriate id attribute value. Using these two tags allows any module to define a place
in the table of contents system (<tocdef>), which also provides a place for
other TOC_Source.xml files to insert content (<tocref>).
During the help build time, all TOC_Source.xml files will be parsed and validated to ensure
that all <tocref> tags point to valid <tocdef> tags. From these files will be generated
<module name>_TOC.xml files, which are table of contents files written in the format
desired by the JavaHelp system. Additionally, the genated files will be merged together
as they are loaded by the JavaHelp system. In the end, when displaying help in the Ghidra
help GUI, there will be on table of contents that has been created from the definitions in
all of the modules' TOC_Source.xml files.
Tags and Attributes
<tocdef>
-id - the name of the definition (this must be unique across all TOC_Source.xml files)
-text - the display text of the node, as seen in the help GUI
-target** - the file to display when the node is clicked in the GUI
-sortgroup - this is a string that defines where a given node should appear under a given
parent. The string values will be sorted by the JavaHelp system using
a javax.text.RulesBasedCollator. If this attribute is not specified, then
the text of attribute will be used.
<tocref>
-id - The id of the <tocdef> that this reference points to
**The URL for the target is relative and should start with 'help/topics'. This text is
used by the Ghidra help system to provide a universal starting point for all links so that
they can be resolved at runtime, across modules.
-->
<tocroot>
<tocref id="Ghidra Functionality">
<tocdef id="BSim"
text="BSim"
target= "help/topics/BSim/BSimOverview.html">
<tocdef id="BSimDatabaseConfiguration" sortgroup="a"
text="BSim Database Configuration"
target="help/topics/BSim/DatabaseConfiguration.html" >
<tocdef id="BSim Overview"
sortgroup="a"
text="Overview"
target="help/topics/BSim/DatabaseConfiguration.html#ConfigOverview" />
<tocdef id="BSim Server Configuration"
sortgroup="b"
text="Server Configuration"
target="help/topics/BSim/DatabaseConfiguration.html#ServerConfig" />
<tocdef id="Creating a BSim Database"
sortgroup="c"
text="Creating a Database"
target="help/topics/BSim/DatabaseConfiguration.html#CreateDatabase" />
<tocdef id="Tailoring BSim Meta-dataX"
sortgroup="d"
text="Tailoring BSim Meta-data"
target="help/topics/BSim/DatabaseConfiguration.html#TailorBSim" />
</tocdef>
<tocdef id="BSimIngestProcess" sortgroup="b"
text="Ingesting Executables"
target="help/topics/BSim/IngestProcess.html" >
<tocdef id="BSim Ingest Process"
sortgroup="a"
text="Ingest Process"
target="help/topics/BSim/IngestProcess.html#IngestOverview"/>
<tocdef id="BSim Tailoring Analysis"
sortgroup="b"
text="Tailoring Analysis"
target="help/topics/BSim/IngestProcess.html#TailorAnalysis"/>
<tocdef id="BSim Analysis Effects on Feature Extraction"
sortgroup="c"
text="Analysis Effects on Feature Extraction"
target="help/topics/BSim/IngestProcess.html#AnalysisEffects"/>
<tocdef id="BSim Maintenance"
sortgroup="d"
text="Maintenance"
target="help/topics/BSim/IngestProcess.html#Maintenance"/>
<tocdef id="BSim Migration"
sortgroup="e"
text="Migration"
target="help/topics/BSim/IngestProcess.html#Migration"/>
</tocdef>
<tocdef id="BSimSearch"
text="BSim Search"
target = "help/topics/BSimSearchPlugin/BSimSearch.html">
<tocdef id="Adding_BSim_Plugin"
sortgroup="a"
text="Enabling the BSim Search Plugin"
target = "help/topics/BSimSearchPlugin/BSimSearch.html#Adding_BSim_Plugin">
</tocdef>
<tocdef id="BSim_Servers_Dialog"
sortgroup="b"
text="Defining And Managing BSim Database Definitions"
target = "help/topics/BSimSearchPlugin/BSimSearch.html#BSim_Servers_Dialog">
</tocdef>
<tocdef id="BSim_Overview_Dialog"
sortgroup="c"
text="Overview Query"
target = "help/topics/BSimSearchPlugin/BSimSearch.html#BSim_Overview_Dialog">
</tocdef>
<tocdef id="BSim_Overview_Results"
sortgroup="d"
text="Overview Query Results"
target = "help/topics/BSimSearchPlugin/BSimSearch.html#BSim_Overview_Results">
</tocdef>
<tocdef id="BSim_Search_Dialog"
sortgroup="e"
text="Similar Function Search"
target = "help/topics/BSimSearchPlugin/BSimSearch.html#BSim_Search_Dialog">
</tocdef>
<tocdef id="Similar_Functions_Results"
sortgroup="f"
text="Similar Function Search Results"
target = "help/topics/BSimSearchPlugin/BSimSearch.html#Similar_Functions_Results">
</tocdef>
<tocdef id="BSim_Authentication"
sortgroup="g"
text="Authentication"
target = "help/topics/BSimSearchPlugin/BSimSearch.html#BSim_Authentication">
</tocdef>
</tocdef>
<tocdef id="BSimFeatureWeight" sortgroup="d"
text="Features and Weights"
target="help/topics/BSim/FeatureWeight.html" >
<tocdef id="BSim Features of Software Functions"
sortgroup="a"
text="Features of Software Functions"
target="help/topics/BSim/FeatureWeight.html#FunctionFeatures"/>
<tocdef id="BSim Weighting Software Features"
sortgroup="b"
text="Weighting Software Features"
target="help/topics/BSim/FeatureWeight.html#WeightingSoftware"/>
<tocdef id="BSim Comparing Feature Vectors"
sortgroup="d"
text="Comparing Feature Vectors"
target="help/topics/BSim/FeatureWeight.html#CompareVectors"/>
</tocdef>
<tocdef id="BSimCommandLine" sortgroup="e"
text="Command-Line Utility Reference"
target="help/topics/BSim/CommandLineReference.html" >
<tocdef id="BSim Control (bsim_ctl)"
sortgroup="a"
text="BSim Control (bsim_ctl)"
target="help/topics/BSim/CommandLineReference.html#BSimCtl"/>
<tocdef id="BSim Command (bsim)"
sortgroup="b"
text="BSim Command (bsim)"
target="help/topics/BSim/CommandLineReference.html#BSimCommand"/>
</tocdef>
</tocdef>
</tocref>
</tocroot>

View File

@ -0,0 +1,25 @@
/* ###
* IP: GHIDRA
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
/*
This file contains non-Ghidra style sheet markup. This file will be loaded in addition to
DefaultStyle.css.
*/
div.informalexample { margin-left: 50px; margin-top: 10px; }
dd { margin-bottom: 20px; }
dd p { margin-top: 5px; margin-left: 10px; }
span.term { font-family:times new roman; font-size:14pt; font-weight:bold; }
span.redtext { color:#CC0033; }

View File

@ -0,0 +1,197 @@
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN">
<HTML>
<HEAD>
<META name="generator" content=
"HTML Tidy for Java (vers. 2009-12-01), see jtidy.sourceforge.net">
<META http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">
<TITLE>BSim Database</TITLE>
<LINK rel="stylesheet" type="text/css" href="help/shared/DefaultStyle.css">
<LINK rel="stylesheet" type="text/css" href="../../shared/languages.css">
<META name="generator" content="DocBook XSL Stylesheets V1.79.1">
<LINK rel="home" href="index.html" title="BSim Database">
<LINK rel="up" href="index.html" title="BSim Database">
<LINK rel="prev" href="index.html" title="BSim Database">
<LINK rel="next" href="DatabaseConfiguration.html" title="Database Configuration">
</HEAD>
<BODY>
<DIV class="chapter">
<DIV class="titlepage">
<DIV>
<DIV>
<H1 class="title"><A name="DatabaseOverview"></A>BSim Database</H1>
</DIV>
</DIV>
</DIV>
<DIV class="note" style="margin-left: 0.5in; margin-right: 0.5in;">
<H3 class="title">Quick Reference Links</H3>
<DIV class="itemizedlist">
<UL class="itemizedlist compact" style="list-style-type: disc;">
<LI class="listitem"><A class="link" href="DatabaseConfiguration.html" title=
"Database Configuration">Database Configuration</A></LI>
<LI class="listitem"><A class="link" href="IngestProcess.html" title=
"Ingesting Executables">Ingesting Executables</A></LI>
<LI class="listitem"><A class="link" href="../BSimSearchPlugin/BSimSearch.html" title=
"Querying a BSim Database">Querying a BSim Database</A></LI>
<LI class="listitem"><A class="link" href="FeatureWeight.html" title=
"Features and Weights">Features and Weights</A></LI>
<LI class="listitem"><A class="link" href="CommandLineReference.html" title=
"Command-Line Utility Reference">Command-Line Reference</A></LI>
</UL>
</DIV>
</DIV>
<DIV class="section">
<DIV class="titlepage">
<DIV>
<DIV>
<H2 class="title" style="clear: both"><A name="IntroOverview"></A>Overview</H2>
</DIV>
</DIV>
</DIV>
<P>Welcome to Ghidra's BSim (Behavioral Similarity) Database. This database technology is
designed to allow reverse engineers to ingest metadata about previously analyzed binary
executables to a central server or local database, which can then be queried in the
course of analyzing new,
unknown, executables to quickly discover previously seen functions and libraries.</P>
<P>The primary record ingested into the database describes a single function. The most
novel aspects of the database are that:</P>
<DIV class="informalexample">
<DIV class="itemizedlist">
<UL class="itemizedlist" style="list-style-type: disc;">
<LI class="listitem">Queries are tolerant of variations in the compilation of the
function.</LI>
<LI class="listitem">All records are indexed for quick queries. (even for very large
collections)</LI>
</UL>
</DIV>
</DIV>
<P>The primary feature set used for indexing a function is extracted from a concise
description of the data-flow of the function, not the explicit encoding of the machine
instructions. The data-flow description is a graph-based (abstract syntax tree)
representation, based on Ghidra's intermediate representation language, p-code, and is
generated by the Ghidra decompiler. The resulting function descriptions are normalized to
minimize the impact of variations due to:</P>
<DIV class="informalexample">
<DIV class="itemizedlist">
<UL class="itemizedlist" style="list-style-type: disc;">
<LI class="listitem">Equivalent machine instructions</LI>
<LI class="listitem">Storage location (registers, stack, memory)</LI>
<LI class="listitem">Instruction order</LI>
<LI class="listitem">Many forms of compiler transformation</LI>
<LI class="listitem">Even some forms of deliberate obfuscation.</LI>
</UL>
</DIV>
</DIV>
<P>Records are indexed using current Text Retrieval strategies, which allow "nearest
neighbor" queries. The feature set of an unknown function being queried does not have to
exactly match the features of a "hit" in the database, but only a configurable percentage
of them. This supplies an additional level of tolerance of "functional difference" on top
of the tolerance of "functionally equivalent" variations provided by the decompiler. In
other words, there can be some amount of true change in the underlying source code, and the
query may still be able to find a match.</P>
<P>Queries are quick: For a single function, results typically come back in microseconds,
even for a database containing millions of functions.</P>
</DIV>
<DIV class="section">
<DIV class="titlepage">
<DIV>
<DIV>
<H2 class="title" style="clear: both"><A name="ToolOverview"></A>Overview of
Tools</H2>
</DIV>
</DIV>
</DIV>
<P>A BSim Database is built on top of one of three technologies: PostgreSQL,
local H2 database, or Elasticsearch.
PostgreSQL is a robust, production capable, server that supports multiple simultaneous
connections and is extremely fault tolerant. Elasticsearch is a scalable search engine that
allows a database to be distributed across an entire cluster of machines.
The local H2 database support is provided for convenience and use with small personal
collections. For any of these options, this distribution includes specific reverse
engineering extensions and clients that provide the following capabilities.</P>
<DIV class="informalexample">
<DIV class="itemizedlist">
<UL class="itemizedlist" style="list-style-type: disc;">
<LI class="listitem">
Integration with a Ghidra Server or local project:
<DIV class="itemizedlist">
<UL class="itemizedlist" style="list-style-type: circle;">
<LI class="listitem">Ingest can be with respect to a Ghidra repository
from either a Ghidra Server or local project.</LI>
<LI class="listitem">Query results can refer to executables within a
repository.</LI>
<LI class="listitem">Easy command-line ingests using the <CODE class=
"filename">bsim</CODE> command script</LI>
</UL>
</DIV>
</LI>
<LI class="listitem">
Client as a Ghidra Plug-in:
<DIV class="itemizedlist">
<UL class="itemizedlist" style="list-style-type: circle;">
<LI class="listitem">Ghidra includes a plug-in client that integrates a query
dialog and results windows directly into the main code browser.</LI>
</UL>
</DIV>
</LI>
<LI class="listitem">
Query API:
<DIV class="itemizedlist">
<UL class="itemizedlist" style="list-style-type: circle;">
<LI class="listitem">Ghidra includes a Java API to the BSim server so that
queries (and potentially ingest) can be incorporated into analyst scripts. The
API marshals queries and results between an active Ghidra session and a BSim
server.</LI>
</UL>
</DIV>
</LI>
</UL>
</DIV>
</DIV>
<DIV class="note" style="margin-left: 0.5in; margin-right: 0.5in;">
<H3 class="title">Note</H3>
<P>The PostgreSQL server software is currently only supported for the <SPAN class=
"emphasis"><EM>Linux</EM></SPAN> and <SPAN class="emphasis"><EM>MacOS</EM></SPAN>
architectures. Elasticsearch server software must be obtained separately. Small local
file-based databases are supported on all platforms via an embedded H2 database
engine. The BSim client
software is supported on all platforms and can connect to servers on a different
architecture.</P>
</DIV>
</DIV>
</DIV>
</BODY>
</HTML>

View File

@ -0,0 +1,820 @@
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN">
<HTML>
<HEAD>
<META name="generator" content=
"HTML Tidy for Java (vers. 2009-12-01), see jtidy.sourceforge.net">
<META http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">
<TITLE>Command-Line Utility Reference</TITLE>
<LINK rel="stylesheet" type="text/css" href="help/shared/DefaultStyle.css">
<LINK rel="stylesheet" type="text/css" href="../../shared/languages.css">
<META name="generator" content="DocBook XSL Stylesheets V1.79.1">
<LINK rel="home" href="index.html" title="BSim Database">
<LINK rel="up" href="index.html" title="BSim Database">
<LINK rel="prev" href="FeatureWeight.html" title="Features and Weights">
</HEAD>
<BODY>
<DIV class="chapter">
<DIV class="titlepage">
<DIV>
<DIV>
<H1 class="title"><A name="CommandLineReference"></A>Command-Line Utility
Reference</H1>
</DIV>
</DIV>
</DIV>
<DIV class="section">
<DIV class="titlepage">
<DIV>
<DIV>
<H2 class="title" style="clear: both"><A name="BSimCtl"></A><CODE class=
"computeroutput">bsim_ctl</CODE></H2>
</DIV>
</DIV>
</DIV>
<DIV class="informalexample">
<PRE>
<CODE class="computeroutput">
bsim_ctl start &lt;/datadir-path [auth=pki|password|trust] [--noLocalAuth] [cafile=&lt;/cacert-path&gt;] [dn=".."]
bsim_ctl stop &lt;/datadir-path&gt; [--force]
bsim_ctl adduser &lt;/datadir-path&gt; &lt;username&gt; [dn=".."]
bsim_ctl dropuser &lt;/datadir-path&gt; &lt;username&gt;
bsim_ctl resetpassword &lt;username&gt;
bsim_ctl changeauth &lt;/datadir-path&gt; [auth=pki|password|trust] [--noLocalAuth] [cafile=&lt;/cacert-path&gt;] [dn=".."]
bsim_ctl changeprivilege &lt;username&gt; admin|user
Global Options:
port=&lt;portnum&gt;
user=&lt;username&gt;
cert=&lt;/certfile-path&gt;
</CODE>
</PRE>
</DIV>
<P><SPAN class="command"><STRONG>bsim_ctl</STRONG></SPAN> is a command-line utility for
starting and stopping a BSim server using the PostgreSQL back-end that is prepackaged with
the Ghidra distribution. All commands must be run on the machine hosting the server.
Optional parameters for a given command are indicated by square brackets '[' and ']'.
Options with an '=' character require a user specified value. If the value string requires
space characters, it should be enclosed in double quotes.</P>
<DIV class="informalexample">
<DIV class="variablelist">
<DL class="variablelist">
<DT><SPAN class="term"><SPAN class="bold"><STRONG>start</STRONG></SPAN></SPAN></DT>
<DD>
<P>Initializes and starts a PostgreSQL server. The command-line must include a path
to the data directory for the server, which must exist. If a server had run
previously and populated this directory, this command simply restarts the server
using the preexisting data and configuration; otherwise, a new database is
initialized. The user performing the initial start is automatically added to the
database with <SPAN class="emphasis"><EM>admin</EM></SPAN> privileges.</P>
<P>During a restart, any authentication options (with the exception of the global
<SPAN class="bold"><STRONG>cert=</STRONG></SPAN> option) are unnecessary and will
be ignored. The PostgreSQL server will be restarted with the already established
settings. To actually change the settings, use the <SPAN class=
"bold"><STRONG>changeauth</STRONG></SPAN> command before restarting.</P>
<P><SPAN class="command"><STRONG>auth=</STRONG></SPAN><SPAN class=
"emphasis"><EM>type</EM></SPAN> - specifies the authentication type (<B>pki |
password | trust</B>) for a new database: <SPAN class=
"emphasis"><EM>trust</EM></SPAN> for no authentication, <SPAN class=
"emphasis"><EM>password</EM></SPAN> for password authentication, and <SPAN class=
"emphasis"><EM>pki</EM></SPAN> for authentication using public key certificates.
With the <SPAN class="emphasis"><EM>pki</EM></SPAN> setting, both the <SPAN class=
"bold"><STRONG>cafile=</STRONG></SPAN> and the <SPAN class=
"bold"><STRONG>dn=</STRONG></SPAN> options also need to be provided; additionally
the <SPAN class="bold"><STRONG>cert=</STRONG></SPAN> option must be provided unless
the <SPAN class="bold"><STRONG>--noLocalAuth</STRONG></SPAN> option is also
given.</P>
<P><SPAN class="command"><STRONG>--noLocalAuth</STRONG></SPAN> - used together with
the <SPAN class="command"><STRONG>auth=</STRONG></SPAN> option causes
authentication to not be required for local connections, i.e. localhost.</P>
<P><SPAN class="command"><STRONG>cafile=</STRONG></SPAN><SPAN class=
"emphasis"><EM>/cafile-path</EM></SPAN> - specifies an absolute path to a
certificate authority file and is required for <SPAN class=
"command"><STRONG>auth=pki</STRONG></SPAN>. This file should contain the
certificates the PostgreSQL server will use to authenticate in PEM format
concatenated together.</P>
<P><SPAN class="command"><STRONG>dn=</STRONG></SPAN><SPAN class=
"emphasis"><EM>name</EM></SPAN> - specifies the Distinguished Name for the admin
user and is required for <SPAN class=
"command"><STRONG>auth=pki</STRONG></SPAN>.</P>
<P><SPAN class="command"><STRONG>port=</STRONG></SPAN><SPAN class=
"emphasis"><EM>portnum</EM></SPAN> - specifies the port the PostgreSQL server will
listen on. For port numbers other than the default 5432, URLs and other
command-lines must explicitly specify the port, when connecting to the server. This
option only effects the initial start of a server. For subsequent (re)starts this
option is ignored, and the server will continue to listen on the same port
specified in the initial start. Use <SPAN class=
"command"><STRONG>changeauth</STRONG></SPAN> to change the port of a server after
its initial start.</P>
</DD>
<DT><SPAN class="term"><SPAN class="bold"><STRONG>stop</STRONG></SPAN></SPAN></DT>
<DD>
<P>Stops a currently running PostgreSQL server. The path to the actively used data
directory must be provided. By default, shutdown will wait until existing
connections to the database have been closed.</P>
<P><SPAN class="command"><STRONG>--force</STRONG></SPAN> - causes existing
connections to be forcibly closed and the PostgreSQL server to shut down
immediately.</P>
</DD>
<DT><SPAN class="term"><SPAN class="bold"><STRONG>adduser</STRONG></SPAN></SPAN></DT>
<DD>
<P>Give a new user permission to access the PostgreSQL server. The path to the
actively used data directory and a single username must be specified. The server
must be running. New users are given <SPAN class="emphasis"><EM>user</EM></SPAN>
(read-only) privileges, unless a subsequent <SPAN class=
"command"><STRONG>changeprivilege</STRONG></SPAN> command is used.</P>
<P><SPAN class="command"><STRONG>dn=</STRONG></SPAN><SPAN class=
"emphasis"><EM>name</EM></SPAN> - specifies the Distinguished Name of the new user,
which is required if the database enabled <SPAN class=
"command"><STRONG>auth=pki</STRONG></SPAN>. This option can be used to provide a
Distinguished Name to a preexisting user, if the PostgreSQL server's authentication
strategy is changed.</P>
</DD>
<DT><SPAN class="term"><SPAN class=
"bold"><STRONG>dropuser</STRONG></SPAN></SPAN></DT>
<DD>
<P>Remove access to the PostgreSQL server for a specific user. The path to the
actively used data directory and a single username must be specified. The server
must be running.</P>
</DD>
<DT><SPAN class="term"><SPAN class=
"bold"><STRONG>changeauth</STRONG></SPAN></SPAN></DT>
<DD>
<P>Change the configuration of a previously initialized PostgreSQL server. The path
to the server's data directory must be specified. The server must not currently be
running to use this command, which only takes effect after a restart. Options have
the same meaning as for the <SPAN class="command"><STRONG>start</STRONG></SPAN>
command.</P>
<P><SPAN class="command"><STRONG>port=</STRONG></SPAN><SPAN class=
"emphasis"><EM>portnum</EM></SPAN> - changes the port the PostgreSQL server will
listen on. If this option is not present, the server will continue to listen on the
same port.</P>
<P><SPAN class="command"><STRONG>auth=</STRONG></SPAN><SPAN class=
"emphasis"><EM>type</EM></SPAN> - changes the authentication type (<B>pki |
password | trust</B>) used by the PostgreSQL server. No change is made if the
option is not present. If the option is present, omitting the <SPAN class=
"command"><STRONG>--noLocalAuth</STRONG></SPAN> causes local connections to require
authentication. This command does not affect the presence or absence of passwords
or Distinguished Names for existing users.</P>
<P><SPAN class="command"><STRONG>dn=</STRONG></SPAN><SPAN class=
"emphasis"><EM>name</EM></SPAN> - specifies the Distinguished Name for the admin
user and is required for <SPAN class=
"command"><STRONG>auth=pki</STRONG></SPAN>.</P>
</DD>
<DT><SPAN class="term"><SPAN class=
"bold"><STRONG>resetpassword</STRONG></SPAN></SPAN></DT>
<DD>
<P>Reset the password for a user. A single user must be specified, and the
PostgreSQL server must be running. The password will be reset to 'changeme'.</P>
</DD>
<DT><SPAN class="term"><SPAN class=
"bold"><STRONG>changeprivilege</STRONG></SPAN></SPAN></DT>
<DD>
<P>Change access privilege for a user. A single user must be specified followed by
<SPAN class="command"><STRONG>admin</STRONG></SPAN> or <SPAN class=
"command"><STRONG>user</STRONG></SPAN>, and the PostgreSQL server must be
running.</P>
</DD>
<DT><SPAN class="term"><SPAN class="bold"><STRONG>--Global
Options--</STRONG></SPAN></SPAN></DT>
<DD>
<P>These options apply to all the <SPAN class=
"command"><STRONG>bsim_ctl</STRONG></SPAN> commands that connect to an active
PostgreSQL server: <SPAN class="command"><STRONG>start</STRONG></SPAN>, <SPAN
class="command"><STRONG>adduser</STRONG></SPAN>, <SPAN class=
"command"><STRONG>dropuser</STRONG></SPAN>, <SPAN class=
"command"><STRONG>resetpassword</STRONG></SPAN>, and <SPAN class=
"command"><STRONG>changeprivilege</STRONG></SPAN>.</P>
<P><SPAN class="command"><STRONG>port=</STRONG></SPAN><SPAN class=
"emphasis"><EM>portnum</EM></SPAN> - specifies the port on which to connect with
the PostgreSQL server.</P>
<P><SPAN class="command"><STRONG>user=</STRONG></SPAN><SPAN class=
"emphasis"><EM>username</EM></SPAN> - specifies a user name to use when connecting
to the PostgreSQL server.</P>
<P><SPAN class="command"><STRONG>cert=</STRONG></SPAN><SPAN class=
"emphasis"><EM>/certfile-path</EM></SPAN> - provides the absolute file path to the
user's certificate when connecting to a PostgreSQL server that requires PKI
authentication.</P>
</DD>
</DL>
</DIV>
</DIV>
</DIV>
<DIV class="section">
<DIV class="titlepage">
<DIV>
<DIV>
<H2 class="title" style="clear: both"><A name="BSimCommand"></A><CODE class=
"computeroutput">bsim</CODE></H2>
</DIV>
</DIV>
</DIV>
<DIV class="informalexample">
<PRE>
<CODE class="computeroutput">
bsim createdatabase &lt;bsimURL&gt; &lt;config_template&gt; [name="&lt;name&gt;"] [owner="&lt;owner&gt;"] [description="&lt;text&gt;"] [--nocallgraph]
bsim setmetadata &lt;bsimURL&gt; [name="&lt;name&gt;"] [owner="&lt;owner&gt;"] [description="&lt;text&gt;"]\n" +
bsim addexecategory &lt;bsimURL&gt; &lt;category_name&gt; [--date]
bsim addfunctiontag &lt;bsimURL&gt; &lt;tag_name&gt;
bsim dropindex &lt;bsimURL&gt;
bsim rebuildindex &lt;bsimURL&gt;
bsim prewarm &lt;bsimURL&gt;
bsim generatesigs &lt;ghidraURL&gt; &lt;/xmldirectory&gt; config=&lt;config_template&gt; [--overwrite]
bsim generatesigs &lt;ghidraURL&gt; &lt;/xmldirectory&gt; bsim=&lt;bsimURL&gt; [--commit] [--overwrite]
bsim generatesigs &lt;ghidraURL&gt; bsim=&lt;bsimURL&gt;
bsim commitsigs &lt;bsimURL&gt; &lt;/xmldirectory&gt; [md5=&lt;hash&gt;] [override=&lt;ghidraURL&gt;]
bsim generateupdates &lt;ghidraURL&gt; &lt;/xmldirectory&gt; config=&lt;config_template&gt; [--overwrite]
bsim generateupdates &lt;ghidraURL&gt; &lt;/xmldirectory&gt; bsim=&lt;bsimURL&gt; [--commit] [--overwrite]
bsim generateupdates &lt;ghidraURL&gt; bsim=&lt;bsimURL&gt;
bsim commitupdates &lt;bsimURL&gt; &lt;/xmldirectory&gt;
bsim listexes &lt;bsimURL&gt; [md5=&lt;hash&gt;] [name=&lt;exe_name&gt;] [arch=&lt;languageID&gt;] [compiler=&lt;cspecID&gt;] [sortcol=&lt;column_name&gt;] [limit=&lt;exe_count&gt;] [--includelibs]
bsim getexecount &lt;bsimURL&gt; [md5=&lt;hash&gt;] [name=&lt;exe_name&gt;] [arch=&lt;languageID&gt;] [compiler=&lt;cspecID&gt;] [--includelibs]
bsim delete &lt;bsimURL&gt; [md5=&lt;hash&gt;] [name=&lt;exe_name&gt; [arch=&lt;languageID&gt;] [compiler=&lt;cspecID&gt;]]
bsim listfuncs &lt;bsimURL&gt; [md5=&lt;hash&gt;] [name=&lt;exe_name&gt; [arch=&lt;languageID&gt;] [compiler=&lt;cspecID&gt;]] [--printselfsig] [--callgraph] [--printjustexe] [maxfunc=&lt;max_count&gt;]
bsim dumpsigs &lt;bsimURL&gt; &lt;/xmldirectory&gt; [md5=&lt;hash&gt;] [name=&lt;exe_name&gt; [arch=&lt;languageID&gt;] [compiler=&lt;cspecID&gt;]]
Global options:
user=&lt;username&gt;
cert=&lt;certfile-path&gt;
</CODE>
</PRE>
</DIV>
<P>See <A class="xref" href="CommandLineReference.html#URLs">&ldquo;Ghidra and BSim
URLs&rdquo;</A> below for details about specifying <EM>ghidraURL</EM> and <EM>bsimURL</EM>
properly. See <A class="xref" href="DatabaseConfiguration.html">&ldquo;Database
Configuration&rdquo;</A> for guidance on the various BSim Databases which are
supported.</P>
<P><SPAN class="command"><STRONG>bsim</STRONG></SPAN> is a command-line utility for
managing the generation and ingest of BSim signatures and metadata. Depending on the
subcommand, it connects to a Ghidra Server and/or a BSim database server. A <SPAN class=
"emphasis"><EM>ghidraURL</EM></SPAN> refers to Ghidra Server or local project using the
<SPAN class="command"><STRONG>ghidra:</STRONG></SPAN> protocol, while <SPAN class=
"emphasis"><EM>bsimURL</EM></SPAN> refers to a BSim database server with the appropriate
<SPAN class="command"><STRONG>postgresql:</STRONG></SPAN>, <SPAN class=
"command"><STRONG>https:</STRONG></SPAN>, or <SPAN class=
"command"><STRONG>file:</STRONG></SPAN> protocol specified. The <SPAN class=
"command"><STRONG>elastic:</STRONG></SPAN> protocol is equivalent to and may be used in
place of the <SPAN class="command"><STRONG>https:</STRONG></SPAN> protocol.</P>
<DIV class="informalexample">
<DIV class="variablelist">
<DL class="variablelist">
<DT><SPAN class="term"><SPAN class=
"bold"><STRONG>createdatabase</STRONG></SPAN></SPAN></DT>
<DD>
<P>Creates a new empty repository. A URL and configuration template (<SPAN class=
"bold"><STRONG>config_template</STRONG></SPAN>) is required. The new database name
is taken from the path element of the URL.</P>
<P>Supported configuration templates (<SPAN class=
"bold"><STRONG>config_template</STRONG></SPAN>) are defined within the Ghidra
installation in XML form. The following configurations are currently defined:
(<SPAN class="bold"><STRONG>large_32, medium_32, medium_64, medium_cpool,
medium_nosize</STRONG></SPAN>).</P>
<P><SPAN class="command"><STRONG>name=</STRONG></SPAN> - specifies a formal, more
descriptive, name for the repository that can be used for the BSim client
display.</P>
<P><SPAN class="command"><STRONG>owner=</STRONG></SPAN> - gives a descriptive name
for the owner of the repository and/or the data it will contain.</P>
<P><SPAN class="command"><STRONG>description=</STRONG></SPAN> - specifies a short
string describing the intended contents of the new repository.</P>
<P><SPAN class="command"><STRONG>--nocallgraph=</STRONG></SPAN><SPAN class=
"emphasis"><EM>yes/no</EM></SPAN> - disables storing call relationships between
ingested functions. Default is to store call relationships.</P>
</DD>
<DT><SPAN class="term"><SPAN class=
"bold"><STRONG>setmetadata</STRONG></SPAN></SPAN></DT>
<DD>
<P>Change the global <SPAN class="emphasis"><EM>name</EM></SPAN>, <SPAN class=
"emphasis"><EM>owner</EM></SPAN>, or <SPAN class=
"emphasis"><EM>description</EM></SPAN> metadata associated with a BSim server. A
BSim server URL is required.</P>
<P><SPAN class="command"><STRONG>name=</STRONG></SPAN> - specifies a formal, more
descriptive, name for the repository that can be used for the BSim client
display.</P>
<P><SPAN class="command"><STRONG>owner=</STRONG></SPAN> - gives a descriptive name
for the owner of the repository and/or the data it will contain.</P>
<P><SPAN class="command"><STRONG>description=</STRONG></SPAN> - specifies a short
string describing the intended contents of the new repository.</P>
</DD>
<DT><SPAN class="term"><SPAN class=
"bold"><STRONG>addexecategory</STRONG></SPAN></SPAN></DT>
<DD>
<P>Specify a new executable category to be included with generated metadata. A BSim
server URL and the name of the new category are required. This only affects future
ingest commands. Executables that have already been ingested are unaffected,
although they can be adjusted with an <SPAN class=
"command"><STRONG>updaterepo</STRONG></SPAN> command.</P>
<P><SPAN class="command"><STRONG>date</STRONG></SPAN> - indicates the new category
holds date/time information.</P>
</DD>
<DT><SPAN class="term"><SPAN class=
"bold"><STRONG>addfunctiontag</STRONG></SPAN></SPAN></DT>
<DD>
<P>Specify a new function tag to be included with generated metadata. A BSim server
URL and the name of the new tag are required. This only affects future ingest
commands. Functions that have already been ingested are unaffected, although they
can be adjusted with an <SPAN class="command"><STRONG>updaterepo</STRONG></SPAN>
command.</P>
</DD>
<DT><SPAN class="term"><SPAN class=
"bold"><STRONG>dropindex</STRONG></SPAN></SPAN></DT>
<DD>
<P>Delete the main signature index from a BSim repository (in preparation for new
ingest). A BSim repository URL is required. Normal queries will not complete or
will be extremely slow.</P>
<P><STRONG>NOTE:</STRONG> Not supported by H2 file database</P>
</DD>
<DT><SPAN class="term"><SPAN class=
"bold"><STRONG>rebuildindex</STRONG></SPAN></SPAN></DT>
<DD>
<P>Recreate the main signature index (that had previously been dropped) for a BSim
repository. A BSim server URL is required. After this command completes, normal
function queries should be fast.</P>
<P><STRONG>NOTE:</STRONG> Not supported by H2 file database</P>
</DD>
<DT><SPAN class="term"><SPAN class="bold"><STRONG>prewarm</STRONG></SPAN></SPAN></DT>
<DD>
<P>Instruct a restarted BSim server to preload pages from the main signature index
and function table into RAM. This avoids slow random access disk reads on initial
queries. A BSim server URL is required.</P>
<P><STRONG>NOTE:</STRONG> Not supported by Elasticsearch or H2 file databases</P>
</DD>
<DT><SPAN class="term"><SPAN class=
"bold"><STRONG>generatesigs</STRONG></SPAN></SPAN></DT>
<DD>
<P>Generates function signatures and metadata for all program files retrieved from
a Ghidra Server repository or project as specified by a Ghidra URL. The generated
signatures may be retained as XML "sigs_" files within a specified XML storage
directory and/or commited to a specified BSim database specified with the <SPAN
class="command"><STRONG>bsim=</STRONG></SPAN><SPAN class=
"emphasis"><EM>bsimURL</EM></SPAN> option. If an XML storage directory is not
specified, a BSim URL must be specified to which the data will be committed.</P>
<P>The <SPAN class="command"><STRONG>config=</STRONG></SPAN><SPAN class=
"emphasis"><EM>config-template</EM></SPAN> option may be specified when generating
XML "sigs_" signature files in the absence of a BSim database (see
<STRONG>createdatabase</STRONG> for supported configurations). The generated files
will be written to the specified XML storage directory. Creation of the signature
files can also be achieved by specifying the <STRONG>bsim=</STRONG><EM>bsimURL</EM>
option instead of the <STRONG>config=</STRONG> option.</P>
<P>The <SPAN class="command"><STRONG>--overwrite</STRONG></SPAN> <SPAN class=
"emphasis">option may be specified when an XML storage directory has also been
specified to allow conflicting signature files to be overwritten.</SPAN></P>
<P>The <SPAN class="command"><STRONG>--commit</STRONG></SPAN> <SPAN class=
"emphasis">option may be specified when a BSim URL has also been specified to allow
generated signatures to be committed to the BSim database. This option is implied
when a BSim URL has been specified without an XML storage directory.</SPAN></P>
</DD>
<DT><SPAN class="term"><SPAN class=
"bold"><STRONG>commitsigs</STRONG></SPAN></SPAN></DT>
<DD>
<P>Commit previously generated signatures and metadata (see
<STRONG>signaturerepo</STRONG>) to a BSim repository. A URL specifying the BSim
repository and a path to a directory containing the "sigs_" XML files to commit are
required.</P>
<P><SPAN class="command"><STRONG>override=</STRONG></SPAN><SPAN class=
"emphasis"><EM>ghidraURL</EM></SPAN> - causes any Ghidra repository/project URL,
describing the storage repository and path of executables that was recorded in the
"sigs_" XML files during signature generation, to be overridden during the commit
operation with the specified Ghidra URL.</P>
</DD>
<DT><SPAN class="term"><SPAN class=
"bold"><STRONG>generateupdates</STRONG></SPAN></SPAN></DT>
<DD>
<P>Generates updated function metadata for program files from a Ghidra Server
repository or project, as specified by a Ghidra URL, which previously had signature
and metadata generated (see <STRONG>generatesigs</STRONG>). Only metadata: names,
function tags, categories, etc. are changed. Signatures are not affected. The
generated updates may be retained as XML "update_" files within a specified XML
storage directory and/or commited to a specified BSim database specified with the
<SPAN class="command"><STRONG>bsim=</STRONG></SPAN><SPAN class=
"emphasis"><EM>bsimURL</EM></SPAN> option. If an XML storage directory is not
specified, a BSim URL must be specified to which the data will be committed.</P>
<P>The <SPAN class="command"><STRONG>config=</STRONG></SPAN><SPAN class=
"emphasis"><EM>config-template</EM></SPAN> option may be specified when generating
XML "update_" files in the absence of a BSim database (see
<STRONG>createdatabase</STRONG> for supported configurations). The generated files
will be written to the specified XML storage directory. Creation of the update
files can also be achieved by specifying the <STRONG>bsim=</STRONG><EM>bsimURL</EM>
option instead of the <STRONG>config=</STRONG> option.</P>
<P>The <SPAN class="command"><STRONG>--overwrite</STRONG></SPAN> <SPAN class=
"emphasis">option may be specified when an XML storage directory has also been
specified to allow conflicting update files to be overwritten.</SPAN></P>
<P>The <SPAN class="command"><STRONG>--commit</STRONG></SPAN> <SPAN class=
"emphasis">option may be specified when a BSim URL has also been specified to allow
generated updates to be committed to the BSim database. This option is implied when
a BSim URL has been specified without an XML storage directory.</SPAN></P>
</DD>
<DT><SPAN class="term"><SPAN class=
"bold"><STRONG>commitupdates</STRONG></SPAN></SPAN></DT>
<DD>
<P>Update a BSim repository with previously generated update metadata (see
<STRONG>generateupdates</STRONG>). A URL specifying the BSim repository and a path
to a directory containing the "update_" XML files to commit are required.</P>
</DD>
<DT><SPAN class="term"><SPAN class=
"bold"><STRONG>listexes</STRONG></SPAN></SPAN></DT>
<DD>
<P>List all executable program records within a specified BSim database repository
which satisfy the specified criteria. A BSim URL specifying the repository must be
provided, and one of two options, <SPAN class=
"command"><STRONG>md5=</STRONG></SPAN> or <SPAN class=
"command"><STRONG>name=</STRONG></SPAN>, that indicate the specific executable must
also be given. All matching executable records will be listed.</P>
<P><SPAN class="command"><STRONG>md5=</STRONG></SPAN><SPAN class=
"emphasis"><EM>32-hexdigits</EM></SPAN> - specifies an executable via its MD5
checksum.</P>
<P><SPAN class="command"><STRONG>name=</STRONG></SPAN> - specifies an executable
name which may match one or more executable records.</P>
<P><SPAN class="command"><STRONG>arch=</STRONG></SPAN> - specifies an architecture
as a Ghidra processor id which will be used to filter executables.</P>
<P><SPAN class="command"><STRONG>compiler=</STRONG></SPAN> - specifies a compiler
specification id which will be used to filter executables.</P>
<P><SPAN class="command"><STRONG>sortcol=</STRONG></SPAN><SPAN class=
"emphasis"><EM>column</EM></SPAN> - Indicates which display column should be used
to sort the results (<STRONG>MD5 | NAME</STRONG>; default:
<STRONG>MD5</STRONG>).</P>
<P><SPAN class="command"><STRONG>limit=</STRONG></SPAN><SPAN class=
"emphasis"><EM>max_count</EM></SPAN> - specifies the maximum number of executables
to be listed which match the search criteria (default=20, a value of 0 indicates no
limit).</P>
<P><SPAN class="command"><STRONG>--includelibs</STRONG> - If specified, executable
records which correspond to a referenced Library will be included. Such records
have a fabricated MD5 which is based on its name.</SPAN></P>
</DD>
<DT><SPAN class="term"><SPAN class=
"bold"><STRONG>getexecount</STRONG></SPAN></SPAN></DT>
<DD>
<P>Get the total number of executable program records within a specified BSim
database repository which satisfy the specified criteria. A BSim URL specifying the
repository must be provided, and one of two options, <SPAN class=
"command"><STRONG>md5=</STRONG></SPAN> or <SPAN class=
"command"><STRONG>name=</STRONG></SPAN>, that indicate the specific executable must
also be given. All matching executable records will be listed.</P>
<P><SPAN class="command"><STRONG>md5=</STRONG></SPAN><SPAN class=
"emphasis"><EM>32-hexdigits</EM></SPAN> - specifies an executable via its MD5
checksum.</P>
<P><SPAN class="command"><STRONG>name=</STRONG></SPAN> - specifies an executable
name which may match one or more executable records.</P>
<P><SPAN class="command"><STRONG>arch=</STRONG></SPAN> - specifies an architecture
as a Ghidra processor id which will be used to filter executables.</P>
<P><SPAN class="command"><STRONG>compiler=</STRONG></SPAN> - specifies a compiler
specification id which will be used to filter executables.</P>
<P><SPAN class="command"><STRONG>--includelibs</STRONG> - If specified, executable
records which correspond to a referenced Library will be included. Such records
have a fabricated MD5 which is based on its name.</SPAN></P>
</DD>
<DT><SPAN class="term"><SPAN class="bold"><STRONG>delete</STRONG></SPAN></SPAN></DT>
<DD>
<P>Remove all records associated with a specific executable from a BSim repository.
A BSim URL specifying the repository must be provided, and one of two options,
<SPAN class="command"><STRONG>md5=</STRONG></SPAN> or <SPAN class=
"command"><STRONG>name=</STRONG></SPAN>, that indicate the specific executable must
also be given. All associated executable and function records are removed.
If an executable cannot be uniquely identified an error will result.
</P>
<P><SPAN class="command"><STRONG>md5=</STRONG></SPAN><SPAN class=
"emphasis"><EM>32-hexdigits</EM></SPAN> - specifies the executable via its MD5
checksum.</P>
<P><SPAN class="command"><STRONG>name=</STRONG></SPAN> - specifies an executable
name which may match one or more executable records.</P>
<P><SPAN class="command"><STRONG>arch=</STRONG></SPAN> - specifies an architecture
as a Ghidra processor id, when the <SPAN class=
"command"><STRONG>name</STRONG></SPAN> option is not enough to uniquely specify the
executable.</P>
<P><SPAN class="command"><STRONG>compiler=</STRONG></SPAN> - specifies a compiler
id string, when the <SPAN class="command"><STRONG>name</STRONG></SPAN> option is
not enough to uniquely specify the executable.</P>
</DD>
<DT><SPAN class="term"><SPAN class=
"bold"><STRONG>listfuncs</STRONG></SPAN></SPAN></DT>
<DD>
<P>List all function records associated with a specific executable from a BSim
repository. A BSim URL specifying the repository must be provided, and one of two
options, <SPAN class="command"><STRONG>md5=</STRONG></SPAN> or <SPAN class=
"command"><STRONG>name=</STRONG></SPAN>, that indicate the specific executable must
also be given. All associated executable and function records are listed. If an
executable cannot be uniquely identified an error will result.</P>
<P><SPAN class="command"><STRONG>md5=</STRONG></SPAN><SPAN class=
"emphasis"><EM>32-hexdigits</EM></SPAN> - specifies the executable via its MD5
checksum.</P>
<P><SPAN class="command"><STRONG>name=</STRONG></SPAN> - specifies an executable
name which may match one or more executable records.</P>
<P><SPAN class="command"><STRONG>arch=</STRONG></SPAN> - specifies an architecture
as a Ghidra processor id, when the <SPAN class=
"command"><STRONG>name</STRONG></SPAN> option is not enough to uniquely specify the
executable.</P>
<P><SPAN class="command"><STRONG>compiler=</STRONG></SPAN> - specifies a compiler
id string, when the <SPAN class="command"><STRONG>name</STRONG></SPAN> option is
not enough to uniquely specify the executable.</P>
<P><SPAN class="command"><STRONG>--printselfsig</STRONG></SPAN> - If specified, each
function listed will be prefixed by a calculated self-significance score. This value is
expressed as a decimal value.</P>
<P><SPAN class="command"><STRONG>--callgraph</STRONG></SPAN> - If specified, a list
of all library functions called by the identified executable will be listed after
the function list.</P>
<P><SPAN class="command"><STRONG>--printjustexe</STRONG> - If specified, only a
summary of the executable will be displayed. If <STRONG>--callgraph</STRONG> was
also specified only the called libraries will be listed and not the specified
functions.</SPAN></P>
<P><SPAN class="command"><STRONG>maxfunc=</STRONG></SPAN><SPAN class=
"emphasis"><EM>max_count</EM></SPAN> - specifies the maximum number of functions to
be listed which correspond to the identified executable (default=1000, a value of 0
indicates no limit).</P>
</DD>
<DT><SPAN class="term"><SPAN class=
"bold"><STRONG>dumpsigs</STRONG></SPAN></SPAN></DT>
<DD>
<P>Dump signature and metadata from a BSim repository for a specific executable to
a "sigs_" XML file. A BSim server URL and a path to a directory where the new file
will be stored must be given. One of two options, <SPAN class=
"command"><STRONG>md5=</STRONG></SPAN> or <SPAN class=
"command"><STRONG>name=</STRONG></SPAN>, that specify the particular executable
must also be given. If an executable cannot be uniquely identified an error will result.</P>
<P><SPAN class="command"><STRONG>md5=</STRONG></SPAN><SPAN class=
"emphasis"><EM>32-hexdigits</EM></SPAN> - specifies an executable via its MD5
checksum.</P>
<P><SPAN class="command"><STRONG>name=</STRONG></SPAN> - specifies an executable
name which may match one or more executable records.</P>
<P><SPAN class="command"><STRONG>arch=</STRONG></SPAN> - specifies an architecture
as a Ghidra processor id, when the <SPAN class=
"command"><STRONG>name</STRONG></SPAN> option is not enough to uniquely specify the
executable.</P>
<P><SPAN class="command"><STRONG>compiler=</STRONG></SPAN> - specifies a compiler
specification id, when the <SPAN class=
"command"><STRONG>name</STRONG></SPAN> option is not enough to uniquely specify the
executable.</P>
</DD>
<DT><SPAN class="term"><SPAN class="bold"><STRONG>--Global
Options--</STRONG></SPAN></SPAN></DT>
<DD>
<P>These options apply to all <SPAN class="command"><STRONG>bsim</STRONG></SPAN>
commands.</P>
<P><SPAN class="command"><STRONG>user=</STRONG></SPAN><SPAN class=
"emphasis"><EM>name</EM></SPAN> - specifies a user to masquerade as when connecting
to the server.</P>
<P><SPAN class="command"><STRONG>cert=</STRONG></SPAN><SPAN class=
"emphasis"><EM>path</EM></SPAN> - provides a path to the user's certificate when
connecting to a server that requires PKI authentication.</P>
</DD>
</DL>
</DIV>
</DIV>
</DIV>
</DIV>
<DIV class="section">
<DIV class="titlepage">
<DIV>
<DIV>
<H2 class="title" style="clear: both"><A name="URLs"></A>Ghidra and BSim URLs</H2>
</DIV>
</DIV>
</DIV>
<P>Ghidra utilizes Universal Resource Locators (URLs) to identify both <EM>Ghidra
Server/Project Repositories</EM> and <EM>BSim Databases</EM>. See the corresponding sections
below for specific formatting details. It is important to note that local <EM>ghidra</EM> and
<EM>file</EM> URLs never include a double-slash after the protocol (i.e, "://").</P>
<DIV class="section">
<DIV class="titlepage">
<DIV>
<DIV>
<H3 class="title" style="clear: both"><A name="GhidraURLs"></A>Ghidra Server/Project
Repository URLs</H3>
</DIV>
</DIV>
</DIV>
<P>BSim command-line tools, as well as the Ghidra GUI, utilize a URL to specify the
location of a remote Ghidra Server repository or a local Ghidra Project. Both cases work in
a very similar fashion other than the format of the URL and potential limitations of a
local Project URL. Use of a Ghidra Server repository and corresponding URLs is preferred
since any Ghidra URL metadata added to a shared BSim database has the ability to be
accessed by other users, while a local Ghidra Project URL is very limited in its visibility
and path validity on other systems. For this reason, use of a local Ghidra Project URL
should be restricted to use with a local H2 BSim Database file.</P>
<P>The format of a remote <EM>Ghidra Server URL</EM> is distinctly different from a
<EM>Local Ghidra Project URL</EM>. These URLs have the following formats:</P>
<P><STRONG>Remote Ghidra Server Repository</STRONG><BR>
</P>
<DIV class="informalexample">
<TABLE border="0" class="simplelist">
<TR>
<TD><CODE class=
"computeroutput">ghidra://&lt;hostname&gt;[:&lt;port&gt;]/&lt;repository_name&gt;[/&lt;folder_path&gt;]</CODE></TD>
</TR>
</TABLE>
</DIV>
<P>If the default Ghidra Server port (1111) is in use it need not be specified with URL.
The <EM>hostname</EM> may specify either a Fully Qualified Domain Name (FQDN, e.g.,
<EM>host.abc.com</EM>) or IP v4 Address (e.g., <EM>1.2.3.4</EM>).</P>
<STRONG>Local Ghidra Project</STRONG><BR>
<DIV class="informalexample">
<TABLE border="0" class="simplelist">
<TR>
<TD><CODE class=
"computeroutput">ghidra:[/&lt;directory_path&gt;]/&lt;project_name&gt;[?/&lt;folder_path&gt;]</CODE></TD>
</TR>
</TABLE>
</DIV>
<P>For local project URLs, the absolute directory path containing the project
<EM>*.gpr</EM> locator file must be specified with the project name. The project name
should exclude any <EM>.gpr/.rep</EM> suffix. Only the '/' character should be used as a
directory separator. In addition, when running on Windows, the directory path should
include its drive desigation preceeded by a '/' (e.g., <CODE class=
"computeroutput">ghidra:/C:/mydir/myproject?/folderA/folderB</CODE>).</P>
</DIV>
<DIV class="section">
<DIV class="titlepage">
<DIV>
<DIV>
<H3 class="title" style="clear: both"><A name="BSimURLs"></A>BSim Database URLs</H3>
</DIV>
</DIV>
</DIV>
<P>BSim command-line tools utilize a URL to specify the type and specific details required
to establish a connection to a specific BSim Database. Within the Ghidra GUI the database
details are not specified using a URL and is done using an interactive form. Each BSim
database type has a distinct URL format:</P>
<DIV class="informalexample">
<TABLE border="0" cellpadding="2" class="simplelist">
<TR>
<TH>Database Type</TH>
<TH align="left">URL Format</TH>
</TR>
<TR>
<TD>PostgreSQL</TD>
<TD><CODE class=
"computeroutput">postgresql://&lt;hostname&gt;[:&lt;port&gt;]/&lt;dbname&gt;</CODE></TD>
</TR>
<TR>
<TD>Elasticsearch</TD>
<TD><CODE class=
"computeroutput">https://&lt;hostname&gt;[:&lt;port&gt;]/&lt;dbname&gt;</CODE></TD>
</TR>
<TR>
<TD>Elasticsearch</TD>
<TD><CODE class=
"computeroutput">elastic://&lt;hostname&gt;[:&lt;port&gt;]/&lt;dbname&gt;</CODE></TD>
</TR>
<TR>
<TD>H2 File</TD>
<TD><CODE class=
"computeroutput">file:[/&lt;directory_path&gt;]/&lt;dbname&gt;</CODE></TD>
</TR>
</TABLE>
</DIV>
<P>The use of the <EM>https</EM> and <EM>elastic</EM> is equivalent.</P>
<P>For local <EM>file</EM> URLs, the absolute path the H2 database <EM>*.mv.db</EM> file
must be specified without the <EM>*.mv.db</EM> extension. Only the '/' character should be
used as a directory separator. In addition, when running on Windows, the directory path
should include its drive desigation preceeded by a '/' (e.g., <CODE class=
"computeroutput">file:/C:/mydir/mydb</CODE>).</P>
</DIV>
</DIV>
</BODY>
</HTML>

View File

@ -0,0 +1,993 @@
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN">
<HTML>
<HEAD>
<META name="generator" content=
"HTML Tidy for Java (vers. 2009-12-01), see jtidy.sourceforge.net">
<META http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">
<TITLE>Database Configuration</TITLE>
<LINK rel="stylesheet" type="text/css" href="help/shared/DefaultStyle.css">
<LINK rel="stylesheet" type="text/css" href="../../shared/languages.css">
<META name="generator" content="DocBook XSL Stylesheets V1.79.1">
<LINK rel="home" href="index.html" title="BSim Database">
<LINK rel="up" href="index.html" title="BSim Database">
<LINK rel="prev" href="DatabaseOverview.html" title="BSim Database">
<LINK rel="next" href="IngestProcess.html" title="Ingesting Executables">
</HEAD>
<BODY>
<DIV class="chapter">
<DIV class="titlepage">
<DIV>
<DIV>
<H1 class="title"><A name="DatabaseConfiguration"></A>Database Configuration</H1>
</DIV>
</DIV>
</DIV>
<DIV class="section">
<DIV class="titlepage">
<DIV>
<DIV>
<H2 class="title" style="clear: both"><A name="ConfigOverview"></A>Overview</H2>
</DIV>
</DIV>
</DIV>
<P>The server for the BSim Database is distinct from the traditional Ghidra server,
although for many use cases it is convenient to have both running and view the BSim server
as a loosely coupled extension to the base Ghidra Server. In terms of start-up, shutdown,
and configuration however, the two servers are completely separate.</P>
<P>There are two choices for deploying a shared server for the BSim Database: PostgreSQL or
Elasticsearch. In addition, a local file-based database may be employed which utilizes an
integrated H2 Database engine. This file-based database is intended for smaller datasets
and its use is limited to a single process.</P>
<P>PostgreSQL software, including the extension necessary for BSim signature indexing,
comes prepackaged with the Ghidra distribution. It runs on a single host and makes
efficient use of whatever CPU, memory, and disk resources are made available to it.
PostgreSQL is a highly robust and capable server that should perform well on minimally
configured workstations up to high-end production hardware.</P>
<P>An Elasticsearch BSim plug-in is included with the Ghidra distribution, but the core
server software must be obtained separately by the database administrator. Elasticsearch is
a scalable text search and analytics database. It automatically distributes itself across
machines in a cluster, allowing individual database queries and requests to be serviced in
parallel. Support for BSim in Elasticsearch should still be considered in prototype, but
all major functionality has been implemented, and the BSim schema takes full advantage of
Elasticsearch as a distributed database.</P>
<P>BSim clients included in the base Ghidra distribution can interface to any of these
databases.</P>
</DIV>
<DIV class="section">
<DIV class="titlepage">
<DIV>
<DIV>
<H2 class="title" style="clear: both"><A name="ServerConfig"></A>Server
Configuration</H2>
</DIV>
</DIV>
</DIV>
<DIV class="sect2">
<DIV class="titlepage">
<DIV>
<DIV>
<H3 class="title"><A name="PostConfig"></A>PostgreSQL Configuration</H3>
</DIV>
</DIV>
</DIV>
<P>The base Ghidra distribution comes with the PostgreSQL software and the extensions
necessary for supporting a BSim database. The PostgreSQL server is most easily managed
using the <SPAN class="bold"><STRONG>bsim_ctl</STRONG></SPAN> command-line script. When
<SPAN class="bold"><STRONG>bsim_ctl start</STRONG></SPAN> is run for the first time (see
below), the PostgreSQL software is unpacked, depending on the host OS, to either</P>
<DIV class="informalexample">
<TABLE border="0" summary="Simple list" class="simplelist">
<TR>
<TD><CODE class="computeroutput">$(ROOT)/Ghidra/Features/BSim/os/linux64/postgresql
OR</CODE></TD>
</TR>
<TR>
<TD><CODE class=
"computeroutput">$(ROOT)/Ghidra/Features/BSim/os/osx64/postgresql</CODE></TD>
</TR>
</TABLE>
</DIV>
<P>BSim will not operate with PostgreSQL without the Ghidra specific extensions, but
otherwise the provided installation is standard. It can be configured just like any other
stand-alone PostgreSQL server. PostgreSQL is highly configurable, and there are no direct
restrictions on modifying the configuration values. A default configuration is provided
with this installation that has been tuned specifically for the BSim Database
application, so in practice there may be little reason to modify it. But there are a few
standard configuration values for the server that might need adjusting. These do impact
important aspects of the server, like the amount of memory allocated to the server and
access restrictions.</P>
<DIV class="sect3">
<DIV class="titlepage">
<DIV>
<DIV>
<H4 class="title"><A name="PostStartStop"></A>Starting and Stopping the
Server</H4>
</DIV>
</DIV>
</DIV>
<P>The basic start-up and shut-down is accomplished with the same command-line script,
which takes either the keyword <SPAN class="command"><STRONG>start</STRONG></SPAN> or
<SPAN class="command"><STRONG>stop</STRONG></SPAN> as the first parameter. The second
parameter must be an absolute path to the chosen data directory.</P>
<DIV class="informalexample">
<TABLE border="0" summary="Simple list" class="simplelist">
<TR>
<TD><CODE class="computeroutput">$(ROOT)/support/bsim_ctl start
/path/to/datadir</CODE></TD>
</TR>
<TR>
<TD><CODE class="computeroutput">$(ROOT)/support/bsim_ctl start /path/to/datadir
port=8000</CODE></TD>
</TR>
<TR>
<TD><CODE class="computeroutput">$(ROOT)/support/bsim_ctl stop
/path/to/datadir</CODE></TD>
</TR>
<TR>
<TD><CODE class="computeroutput">$(ROOT)/support/bsim_ctl stop /path/to/datadir
force</CODE></TD>
</TR>
</TABLE>
</DIV>
<P>The data directory should already exist and should initially not contain any files.
The first time a server is started for a particular data directory, a large number of
configuration files and other sub-directories associated with the PostgreSQL server
will automatically be created. Upon subsequent restarts the existing configuration will
be reused.</P>
<P>The <SPAN class="bold"><STRONG>start</STRONG></SPAN> command can take an optional
<SPAN class="bold"><STRONG>port=</STRONG></SPAN> parameter. This can be used to specify
a non-standard port for the PostgreSQL server to listen on. In this case, any
subsequent reference to the BSim server, in the Ghidra client, or with the <SPAN class=
"command"><STRONG>bsim</STRONG></SPAN> command described below, must specify the port.
When using the <SPAN class="command"><STRONG>bsim</STRONG></SPAN> command, a
non-default port must be explicitly specified with the BSim <SPAN class=
"command"><STRONG>postgresql://</STRONG></SPAN> URL (see <A class="xref" href=
"CommandLineReference.html#URLs">&ldquo;Ghidra and BSim URLs&rdquo;</A> for more
details).</P>
<P>The <SPAN class="command"><STRONG>stop</STRONG></SPAN> command can take the keyword
<SPAN class="command"><STRONG>force</STRONG></SPAN> as an optional parameter. Without
this, the shutdown of the server will wait until all currently connected clients finish
their sessions. Adding this parameter will cause all clients to be disconnected
immediately, rolling back any transactions, and the server will shutdown
immediately.</P>
</DIV>
<DIV class="sect3">
<DIV class="titlepage">
<DIV>
<DIV>
<H4 class="title"><A name="PostSecurityAuthentication"></A>Security and
Authentication</H4>
</DIV>
</DIV>
</DIV>
<P>BSim makes use of PostgreSQL security mechanisms to enforce privileges and
authenticate users. The <SPAN class="command"><STRONG>bsim_ctl</STRONG></SPAN> command
wraps the subset of functionality described here, but other adjustments are possible by
connecting directly to the server and issuing SQL commands.</P>
<P>The PostgreSQL server, as configured for BSim, only accepts connections via SSL, so
communications in transit are always encrypted regardless of the authentication
settings.</P>
<P>PostgreSQL uses the concept of <SPAN class="emphasis"><EM>roles</EM></SPAN> to grant
access privileges based on particular users. Generally, a user's role is determined by
the <SPAN class="emphasis"><EM>username</EM></SPAN> used to establish the connection.
For BSim, each user role is granted one of two privilege levels: <SPAN class=
"command"><STRONG>user</STRONG></SPAN>, which allows read-only access to the server for
normal queries, and <SPAN class="command"><STRONG>admin</STRONG></SPAN>, which
additionally allows database creation, ingest, update, and deletion.</P>
<P>BSim supports three different authentication methods, when connecting as a client or
during database ingest and maintenance. This method is established for a server by the
initial <SPAN class="command"><STRONG>start</STRONG></SPAN> command.</P>
<DIV class="informalexample">
<DIV class="variablelist">
<DL class="variablelist">
<DT><SPAN class="term"><SPAN class=
"bold"><STRONG>trust</STRONG></SPAN></SPAN></DT>
<DD>
<P><CODE class="computeroutput">bsim_ctl start /path/to/datadir
auth=trust</CODE></P>
<P>This is currently the default. No authentication is performed and privilege
is granted based on the user name presented. Masquerading is possible.</P>
</DD>
<DT><SPAN class="term"><SPAN class=
"bold"><STRONG>password</STRONG></SPAN></SPAN></DT>
<DD>
<P><CODE class="computeroutput">bsim_ctl start /path/to/datadir
auth=password</CODE></P>
<P>Users are authenticated via password. A default password 'changeme' is
established when the new user is created. Passwords can be changed by the user
from the BSim client or can be reset by an administrator via the <SPAN class=
"command"><STRONG>resetpassword</STRONG></SPAN> command.</P>
</DD>
<DT><SPAN class="term"><SPAN class="bold"><STRONG>pki</STRONG></SPAN></SPAN></DT>
<DD>
<P><CODE class="computeroutput">bsim_ctl start /path/to/datadir auth=pki
ca=/path/to/rootcert</CODE></P>
<P>Users are authenticated by PKI certificates. Upon initialization, the BSim
server must be provided (via the <SPAN class=
"command"><STRONG>ca=</STRONG></SPAN> option) a file containing the public keys
for the certificate authorities used to issue user's certificates. The file
consists of the authoritative certificates in PEM format concatenated
together.</P>
<P>BSim users must register their certificate with the Ghidra client using the
<SPAN class="emphasis"><EM>Edit-&gt;Set PKI Certificate...</EM></SPAN> menu
option from the Project dialog. The BSim client will automatically submit the
certificate to a server that requests it, and the password to unlock it will be
requested as needed. This is the same mechanism used to a access a PKI
protected Ghidra server, and if a user needs access to both a BSim server and
Ghidra server that are PKI protected, the servers should probably be configured
with the same certificate authorities so that they will accept the same
certificate from the user.</P>
<P>With PKI authentication enabled, at the time a new user role is established
with the server, the X.509 Distinguished Name, as bound to the user's
certificate, must be associated with the user name via the <SPAN class=
"command"><STRONG>dn=</STRONG></SPAN> option. See <A class="xref" href=
"#PostAddUser" title="Adding Users to the Database">&ldquo;Adding Users to the
Database&rdquo;</A>.</P>
</DD>
</DL>
</DIV>
</DIV>
<P>The authentication method should be established once, the first time the <SPAN
class="command"><STRONG>start</STRONG></SPAN> command is issued for the server on an
empty data directory. Subsequent restarts of the server will not change these settings.
If the settings really need to be changed, the <SPAN class=
"command"><STRONG>changeauth</STRONG></SPAN> command can be issued. It takes the same
options as the <SPAN class="command"><STRONG>start</STRONG></SPAN> command and can only
be run if the server is shutdown first.</P>
<DIV class="informalexample">
<TABLE border="0" summary="Simple list" class="simplelist">
<TR>
<TD><CODE class="computeroutput">$(ROOT)/support/bsim_ctl changeauth
/datadir/path auth=password</CODE></TD>
</TR>
</TABLE>
</DIV>
<P>Using the <SPAN class="command"><STRONG>changeauth</STRONG></SPAN> command on a
server with an established set of users will likely require other disruptive changes to
create passwords or associate Distinguished Names with users, if they didn't exist
before.</P>
<P>If it is determined that only the database administrators have OS level, local,
access to the server's host machine, they can choose to use the <SPAN class=
"command"><STRONG>noLocalAuth</STRONG></SPAN> option as part of the <SPAN class=
"command"><STRONG>start</STRONG></SPAN> or <SPAN class=
"command"><STRONG>changeauth</STRONG></SPAN> commands. This disables authentication for
users connecting to the server by the 'localhost' interface. This may facilitate the
use of scripts for ingest etc., where working with passwords is cumbersome.
Authentication is still enforced for any remote connection.</P>
</DIV>
<DIV class="sect3">
<DIV class="titlepage">
<DIV>
<DIV>
<H4 class="title"><A name="PostAddUser"></A>Adding Users to the Database</H4>
</DIV>
</DIV>
</DIV>
<P>The username used to start the server for the first time, causing the initialization
of the data directory, becomes the administrator for that server. No other
username/role is initially known to the server. New usernames/roles can be added to the
server using the following command:</P>
<DIV class="informalexample">
<TABLE border="0" summary="Simple list" class="simplelist">
<TR>
<TD><CODE class="computeroutput">$(ROOT)/support/bsim_ctl adduser <SPAN class=
"emphasis"><EM>username</EM></SPAN></CODE></TD>
</TR>
<TR>
<TD><CODE class="computeroutput">$(ROOT)/support/bsim_ctl adduser <SPAN class=
"emphasis"><EM>username</EM></SPAN> dn="C=US,ST=MD,CN=Firstname User"</CODE></TD>
</TR>
</TABLE>
</DIV>
<P>If password authentication has been set for the server, the new user's password will
initially be set to 'changeme'. If PKI authentication has been set for the server, The
Distinguished Name, as bound to the new user's certificated must be provided when
issuing the <SPAN class="command"><STRONG>adduser</STRONG></SPAN> command, via the
<SPAN class="command"><STRONG>dn=</STRONG></SPAN> option. The Distinguished Name must
be presented as a string containing a comma separated sequence of attribute/value pairs
that uniquely identifies a certificate. Currently, the Common Name (CN=) is the only
attribute inspected by the PostgreSQL server, so other attributes can be omitted.</P>
<P>New users are by default only given <SPAN class=
"command"><STRONG>user</STRONG></SPAN> permissions, meaning that they can only place
queries to the database and cannot ingest, update, or delete data. The new user can be
given <SPAN class="command"><STRONG>admin</STRONG></SPAN> privileges (by an existing
administrator) by issuing the command:</P>
<DIV class="informalexample">
<TABLE border="0" summary="Simple list" class="simplelist">
<TR>
<TD><CODE class="computeroutput">$(ROOT)/support/bsim_ctl changeprivilege <SPAN
class="emphasis"><EM>username</EM></SPAN> admin</CODE></TD>
</TR>
</TABLE>
</DIV>
</DIV>
<DIV class="sect3">
<DIV class="titlepage">
<DIV>
<DIV>
<H4 class="title"><A name="PostAdditionalConfig"></A>Additional
Configuration</H4>
</DIV>
</DIV>
</DIV>
<P>The relevant configuration files are at the top level of the data directory:</P>
<DIV class="informalexample">
<TABLE border="0" summary="Simple list" class="simplelist">
<TR>
<TD><CODE class="computeroutput">postgresql.conf</CODE></TD>
</TR>
<TR>
<TD><CODE class="computeroutput">pg_hba.conf</CODE></TD>
</TR>
</TABLE>
</DIV>
<P>The most important configuration parameters in <CODE class=
"filename">postgresql.conf</CODE> are:</P>
<DIV class="informalexample">
<DIV class="variablelist">
<DL class="variablelist">
<DT><SPAN class="term"><SPAN class=
"bold"><STRONG>shared_buffers</STRONG></SPAN></SPAN></DT>
<DD>
<P>This controls the amount of RAM available for caching database pages across
all connections to the server. The default should be reasonable in most
situations, but for large databases or many simultaneous connections it might
make sense to increase this.</P>
</DD>
<DT><SPAN class="term"><SPAN class=
"bold"><STRONG>max_wal_size</STRONG></SPAN>,</SPAN> <SPAN class="term"><SPAN
class="bold"><STRONG>checkpoint_timeout</STRONG></SPAN></SPAN></DT>
<DD>
<P>These control how often the server forces database pages to be written back
out to the file-system. The defaults are set to minimize disk writes when
ingesting large numbers of records in one session. There should be little
reason to change these values.</P>
</DD>
<DT><SPAN class="term"><SPAN class=
"bold"><STRONG>ssl_cipher</STRONG></SPAN></SPAN></DT>
<DD>
<P>This controls which ciphers the server allows when negotiating a connection.
The defaults are reasonable, but administrators may want more control. The
setting 'TLSv1.2', for instance, can be used to be compliant with the latest
TLS standard.</P>
</DD>
</DL>
</DIV>
</DIV>
<P>The <CODE class="filename">pg_hba.conf</CODE> file is used to configure which
connections the server accepts for a particular outward facing IP address and what
security mechanisms are enforced for those connections. Currently all addresses are
configured to accept SSL connections only, except possibly for 'localhost'.
Administrators <SPAN class="emphasis"><EM>can</EM></SPAN> currently filter connections
based on usernames and the particular database (which corresponds to Ghidra's concept
of <SPAN class="emphasis"><EM>repository</EM></SPAN>).</P>
<DIV class="warning" style="margin-left: 0.5in; margin-right: 0.5in;">
<H3 class="title">Warning</H3>
<P>By default, the server accepts all connections from all users.</P>
</DIV>
</DIV>
<DIV class="sect3">
<DIV class="titlepage">
<DIV>
<DIV>
<H4 class="title"><A name="ConfigDefaults"></A>Configuration Defaults</H4>
</DIV>
</DIV>
</DIV>
<P>There is a <CODE class="filename">serverconfig.xml</CODE> which contains a few of
the default configuration values that are most crucial for the BSim Database. <SPAN
class="bold"><STRONG>Beware:</STRONG></SPAN> This file is currently parsed only once
for the entire <SPAN class="emphasis"><EM>lifetime</EM></SPAN> of a particular data
directory: it is read only when the data directory is first initialized, i.e. the first
time the <SPAN class="command"><STRONG>bsim_ctl start</STRONG></SPAN> command is
invoked on the empty directory. This file is intended to provide reasonable defaults
that are different from the standard PostgreSQL defaults. To provide site specific
configuration, changes should be made to the normal PostgreSQL configuration files.</P>
</DIV>
</DIV>
<DIV class="sect2">
<DIV class="titlepage">
<DIV>
<DIV>
<H3 class="title"><A name="ElasticConfig"></A>Elasticsearch Configuration</H3>
</DIV>
</DIV>
</DIV>
<P>A full description of how to configure an Elasticsearch cluster, including how to
start and stop the server, is beyond the scope of this document. In particular, the <SPAN
class="command"><STRONG>bsim_ctl</STRONG></SPAN> command-line, as described in <A class=
"xref" href="DatabaseConfiguration.html#PostConfig" title=
"PostgreSQL Configuration">&ldquo;PostgreSQL Configuration&rdquo;</A>, does not apply to
Elasticsearch. Complete documentation is available on-line from the Elasticsearch
website.</P>
<DIV class="sect3">
<DIV class="titlepage">
<DIV>
<DIV>
<H4 class="title"><A name="ElasticInstall"></A>Installing the Plug-in</H4>
</DIV>
</DIV>
</DIV>
<P>In order to make use of Elasticsearch with BSim, the database administrator must
install the <SPAN class="emphasis"><EM>lsh.zip</EM></SPAN> plug-in as part of the
Elasticsearch deployment. The plug-in is available in the Ghidra add-on named <SPAN
class="emphasis"><EM>BSimElasticPlugin</EM></SPAN>, which unpacks into a standard
Ghidra installation. The file <SPAN class="emphasis"><EM>lsh.zip</EM></SPAN> is a
standard Elasticsearch plug-in that must be installed on every node of the cluster
before a BSim repository can be created. The Elasticsearch distribution typically comes
preconfigured for a single node deployment. The description below shows how to enable
BSim on such a toy deployment, but this will need to be extended to support an entire
cluster.</P>
<P>Assuming the add-on has been unpacked, the plug-in can be installed to a single node
using the <SPAN class="emphasis"><EM>elasticsearch-plugin</EM></SPAN> command in the
<SPAN class="emphasis"><EM>bin</EM></SPAN> directory of the node's Elasticsearch
installation.</P>
<DIV class="informalexample">
<TABLE border="0" summary="Simple list" class="simplelist">
<TR>
<TD><CODE class="computeroutput">bin/elasticsearch-plugin install
file:///path/to/ghidra/Ghidra/contrib/BSimElasticPlugin/data/lsh.zip</CODE></TD>
</TR>
</TABLE>
</DIV>
<P>Replace the initial portion of the absolute path in the URL to point to the Ghidra
installation. Once the plug-in is installed, the toy deployment can be (re)started from
the command-line by running</P>
<DIV class="informalexample">
<TABLE border="0" summary="Simple list" class="simplelist">
<TR>
<TD><CODE class="computeroutput">bin/elasticsearch</CODE></TD>
</TR>
</TABLE>
</DIV>
<P>This will dump logging messages to the console, and you should see <CODE class=
"computeroutput">[lsh]</CODE> listed among the loaded plug-ins as the node starts
up.</P>
</DIV>
<DIV class="sect3">
<DIV class="titlepage">
<DIV>
<DIV>
<H4 class="title"><A name="ElasticURL"></A>The Elasticsearch URL</H4>
</DIV>
</DIV>
</DIV>
<P>Assuming an Elasticsearch cluster is running and the plug-in has been properly
installed, all other parts of BSim interact transparently with the cluster. The <SPAN
class="command"><STRONG>bsim</STRONG></SPAN> command, described in <A class="xref"
href="IngestProcess.html" title="Ingesting Executables"><I>Ingesting
Executables</I></A>, and the Ghidra/BSim client, described in <A class="xref" href=
"../BSimSearchPlugin/BSimSearch.html" title="Querying a BSim Database"><I>Querying a BSim
Database</I></A>, require no additional configuration to work with Elasticsearch,
except users must provide the correct URL to establish a connection. Elasticsearch
communicates over <SPAN class="emphasis"><EM>https</EM></SPAN>, and BSim clients
automatically assume they are communicating with Elasticsearch when they see this
protocol. Alternatively, the protocol may be specified as <SPAN class=
"emphasis"><EM>elastic</EM></SPAN> when using the <SPAN class=
"command"><STRONG>bsim</STRONG></SPAN> command. Elasticsearch use by BSim assumes a
default port of 9200 unless otherwise specified when specifying the server host. See <A
class="xref" href="CommandLineReference.html#URLs">&ldquo;Ghidra and BSim
URLs&rdquo;</A> for additional information about URLs.</P>
</DIV>
</DIV>
</DIV>
<DIV class="section">
<DIV class="titlepage">
<DIV>
<DIV>
<H2 class="title" style="clear: both"><A name="CreateDatabase"></A>Creating a
Database</H2>
</DIV>
</DIV>
</DIV>
<P>If using either PostgreSQL or Elasticsearch the server must be properly configured and
running before a <SPAN class="bold"><STRONG>database</STRONG></SPAN> can be created. In the
case of an H2 file-based database there is no server requirement. Only after a database has
been created can data be ingested or queries performed. In this context, a database is a
single container of reverse engineered functions. Metadata pertaining to executables and
call-graph relationships is also stored, but the principle database record describes a
<SPAN class="emphasis"><EM>function</EM></SPAN>. A single PostgreSQL or Elasticsearch
server can hold multiple independent databases.</P>
<P>A database is created using the <SPAN class="command"><STRONG>bsim</STRONG></SPAN>
command script. The basic command looks like</P>
<DIV class="informalexample">
<TABLE border="0" summary="Simple list" class="simplelist">
<TR>
<TD><CODE class="computeroutput">$(ROOT)/support/bsim createdatabase <SPAN class=
"emphasis"><EM>bsimURL</EM></SPAN> <SPAN class=
"emphasis"><EM>config_template</EM></SPAN></CODE></TD>
</TR>
</TABLE>
</DIV>
<P>A BSim database is completely distinct from the Ghidra Server or Ghidra project, so the
executables and functions contained within do not need to coincide at all.</P>
<P>The Ghidra GUI client specifies a BSim database with its explicit characteristics (i.e.,
DB type, name, host/port if applicable, etc.), while the <SPAN class=
"command"><STRONG>bsim</STRONG></SPAN> command accepts a <SPAN class=
"emphasis"><EM>bsimURL</EM></SPAN> which includes similar details (see <A class="xref"
href="CommandLineReference.html#URLs">&ldquo;Ghidra and BSim URLs&rdquo;</A> for more
details).</P>
<P>The <SPAN class="emphasis"><EM>config_template</EM></SPAN> parameter passed to <SPAN
class="command"><STRONG>bsim createdatabase</STRONG></SPAN> names a collection of specific
configuration values for the newly created database. A standard Ghidra distribution
provides a number of predefined templates (See below) designed for specific database use
cases. It is simplest to use a predefined template when creating a database, but it is
possible to edit an existing template or create a new template (See <A class="xref" href=
"DatabaseConfiguration.html#DatabaseTemplates" title=
"Creating Database Templates">&ldquo;Creating Database Templates&rdquo;</A>).</P>
<P>There are two critical database properties being determined by the template that need to
be kept in mind: the <SPAN class="bold"><STRONG>index tuning</STRONG></SPAN> and the <SPAN
class="bold"><STRONG>weighting scheme</STRONG></SPAN> relative to the size of the database.
The two pieces of the template name, separated by the '_' character, refer to these
concerns.</P>
<P>The index tuning affects the use of the database by trading off between, the time
required to perform individual queries, the amount of variation between matching functions
a query can tolerate, and the amount of storage required per database record. Ideally, the
database is tuned, before the initial ingest occurs, to the <SPAN class=
"emphasis"><EM>anticipated size</EM></SPAN> of the database. The database can trade off
storage size (per record) and latency for overall query response time, but the decision
needs to be made before the database is populated. Currently there is a <SPAN class=
"bold"><STRONG>medium</STRONG></SPAN> tuning that is ideal for repositories that will store
on the order of 10 million functions. There is also a <SPAN class=
"bold"><STRONG>large</STRONG></SPAN> tuning, which uses more storage but can maintain fast
query times for databases with 100 million functions or more. There is a large overlap for
these tunings, so if its unclear how large a database might grow, go ahead and use the
medium tuning.</P>
<P>The weighting scheme affects how BSim views the relative importance of individual code
constructs within a function. Code constructions are extracted as <SPAN class=
"emphasis"><EM>features</EM></SPAN>, and each feature is assigned a weight. The basic
schemes are: <SPAN class="bold"><STRONG>32</STRONG></SPAN> for 32-bit compiled code, <SPAN
class="bold"><STRONG>64</STRONG></SPAN> for 64-bit code. The scheme that matches the
predominant form of code in the repository being ingested should be used. Mixed schemes are
possible, but a corpus which is predominantly 32-bit, even with a small number of 64-bit
executables mixed in, should still use the 32-bit weights.</P>
<P>There are some weighting schemes designed for more specialized code. The <SPAN class=
"bold"><STRONG>64_32</STRONG></SPAN> scheme is for 64-bit code using 32-bit pointers. The
<SPAN class="bold"><STRONG>nosize</STRONG></SPAN> scheme allows better matching of 32-bit
functions to 64-bit functions, when they are compiled from the same source. The <SPAN
class="bold"><STRONG>cpool</STRONG></SPAN> scheme is designed for Java byte-code or Dalvik
executables. For more discussion of weighting, see <A class="xref" href=
"FeatureWeight.html#WeightingSoftware" title="Weighting Software Features">&ldquo;Weighting
Software Features&rdquo;</A>.</P>
<P>The full template name incorporates both an index tuning and a weight scheme. Some
common examples of template names:</P>
<DIV class="informalexample">
<DIV class="variablelist">
<DL class="variablelist">
<DT><SPAN class="term"><SPAN class=
"bold"><STRONG>medium_32</STRONG></SPAN></SPAN></DT>
<DD>
<P>A medium index tuning with a weighting scheme designed for 32-bit
executables.</P>
</DD>
<DT><SPAN class="term"><SPAN class=
"bold"><STRONG>medium_64</STRONG></SPAN></SPAN></DT>
<DD>
<P>A medium index tuning with a weighting scheme designed for 64-bit
executables.</P>
</DD>
<DT><SPAN class="term"><SPAN class=
"bold"><STRONG>large_32</STRONG></SPAN></SPAN></DT>
<DD>
<P>A 32-bit weighting scheme with tuning for a large database size.</P>
</DD>
<DT><SPAN class="term"><SPAN class=
"bold"><STRONG>medium_cpool</STRONG></SPAN></SPAN></DT>
<DD>
<P>A medium index tuning with a weighting scheme for Java executables.</P>
</DD>
<DT><SPAN class="term"><SPAN class=
"bold"><STRONG>medium_nosize</STRONG></SPAN></SPAN></DT>
<DD>
<P>A medium index tuning with a weighting scheme allowing matches between 32-bit
and 64-bit code.</P>
</DD>
</DL>
</DIV>
</DIV>
</DIV>
<DIV class="section">
<DIV class="titlepage">
<DIV>
<DIV>
<H2 class="title" style="clear: both"><A name="TailorBSim"></A>Tailoring BSim
Metadata</H2>
</DIV>
</DIV>
</DIV>
<P>There is some facility to tailor a specific BSim database instance so that it can ingest
and/or report information about executables or functions to make results more useful for a
specific project or user. Capabilities can be added after a database has been created and
is running by issuing specific <SPAN class="command"><STRONG>bsim</STRONG></SPAN> commands,
but they can also be added to a <SPAN class="emphasis"><EM>configuration
template</EM></SPAN> prior to creating the database, which provides a record of the
specific additions should the database instance need to be recreated or multiple tailored
instances be deployed. For additions that allow the ingest of more metadata about
executables or functions, users must provide additional scripts to Ghidra during the ingest
process in order to read in or discover the new metadata.</P>
<P>The <SPAN class="bold"><STRONG>Name</STRONG></SPAN>, <SPAN class=
"bold"><STRONG>Owner</STRONG></SPAN>, and <SPAN class=
"bold"><STRONG>Description</STRONG></SPAN> associated with a BSim instance can be trivially
tailored with the <SPAN class="command"><STRONG>bsim setmetadata</STRONG></SPAN>
command.</P>
<DIV class="informalexample">
<TABLE border="0" summary="Simple list" class="simplelist">
<TR>
<TD><CODE class="computeroutput">$(ROOT)/support/bsim setmetadata <SPAN class=
"emphasis"><EM>bsimURL</EM></SPAN> "name=BSim Database"</CODE></TD>
</TR>
<TR>
<TD><CODE class="computeroutput">$(ROOT)/support/bsim setmetadata <SPAN class=
"emphasis"><EM>bsimURL</EM></SPAN> "owner=Administrators"</CODE></TD>
</TR>
<TR>
<TD><CODE class="computeroutput">$(ROOT)/support/bsim setmetadata <SPAN class=
"emphasis"><EM>bsimURL</EM></SPAN> "description=Files of interest"</CODE></TD>
</TR>
</TABLE>
</DIV>
<P>This information is displayed in various windows by the BSim client. The values can be
changed at any time and do not otherwise affect the records contained in the database.
Multiple command-line parameters can be fed to <SPAN class="command"><STRONG>bsim
setmetadata</STRONG></SPAN> so long as each one starts with <SPAN class=
"bold"><STRONG>name=</STRONG></SPAN>, <SPAN class="bold"><STRONG>owner=</STRONG></SPAN>, or
<SPAN class="bold"><STRONG>description=</STRONG></SPAN> respectively. Quoting may be
necessary to get some strings to be interpreted as a single command-line parameter.</P>
<DIV class="sect2">
<DIV class="titlepage">
<DIV>
<DIV>
<H3 class="title"><A name="ExeCat"></A>Executable Categories</H3>
</DIV>
</DIV>
</DIV>
<P>BSim provides the powerful ability to associate new types of metadata with each
executable that the database ingests. Any method of categorizing executables that
describes an executable with a simple string value, referred to here as an executable
<SPAN class="bold"><STRONG>category</STRONG></SPAN>, can be added as a field to the
database. With only minor adjustments to the ingest process, new category values can be
automatically attached to incoming executables and are treated like any other executable
field that BSim understands. Category values are retrieved with queries, can be used for
filtering, and show up as sortable columns in result tables.</P>
<P>All categories have a formal name (or type), which is used both in the ingest process
(See below) and as the label for table columns. The name can contain alphanumeric
characters or punctuation from the limited set, ' ._:/()'. For each executable there can
be zero, one, or more <SPAN class="emphasis"><EM>string</EM></SPAN> values associated
with the category. No value is required for the executable, and any single value can be
used for filtering (either the executable is labeled with the value or it is not) even if
there are multiple values for that category. If there are multiple values, a query that
matches the executable will return all the values as a single sorted column entry.</P>
<P>It is also possible to create a special time-based category. This category can have
any name as above, but instead of associating string values with the executable, it
associates a single time-stamp. The time-stamp has precision down to the millisecond and
provides filtering and sorting based on time. Internally, this new category repurposes
the column storage originally providing an executable's <SPAN class="emphasis"><EM>Ingest
Date</EM></SPAN> field. As a result, any BSim instance
can have only one time category and only one time-stamp within it. The ingest scripting
must provide any actual time-stamp value for the executable, or the database will fill in
the "epoch", 12:00 am, Jan 1, 1970.</P>
<P>A new category can be added to the database at any time using the <SPAN class=
"command"><STRONG>bsim addexecategory</STRONG></SPAN> command.</P>
<DIV class="informalexample">
<TABLE border="0" summary="Simple list" class="simplelist">
<TR>
<TD><CODE class="computeroutput">$(ROOT)/support/bsim addexecategory <SPAN class=
"emphasis"><EM>bsimURL</EM></SPAN> MyCategoryName</CODE></TD>
</TR>
<TR>
<TD><CODE class="computeroutput">$(ROOT)/support/bsim addexecategory <SPAN class=
"emphasis"><EM>bsimURL</EM></SPAN> MyTimeField date</CODE></TD>
</TR>
</TABLE>
</DIV>
<P>The single time-stamp field can be renamed by appending the keyword "date" to the
command after the name of the category. Once a category, the corresponding program
options set for any new executables will automatically read into the database as part of
the ingest process. Previously ingested executables, assuming they have the new program
options set, can be updated within the BSim database using one of the <SPAN class=
"command"><STRONG>bsim updaterepo</STRONG></SPAN> command variants. In either case, the
relevant program options typically need to be filled by running a Ghidra script (See <A
class="xref" href="IngestProcess.html#IngestExeCat" title=
"Ingesting Executable Categories">&ldquo;Ingesting Executable Categories&rdquo;</A>).
There is currently no method for deleting a category once it has been created.</P>
</DIV>
<DIV class="sect2">
<DIV class="titlepage">
<DIV>
<DIV>
<H3 class="title"><A name="FunctionTags"></A>Function Tags</H3>
</DIV>
</DIV>
</DIV>
<P>BSim can be configured to recognize specific <SPAN class="bold"><STRONG>Function
Tags</STRONG></SPAN>, which are named Boolean properties on individual functions within
an executable. Within a Ghidra program, any number of different function tags can be
established by the user and are used to label individual functions or specific subsets of
functions that share a particular property. This would typically be used to label classes
of functions that are important to analysts, unpacked functions could be labeled with the
tag <SPAN class="emphasis"><EM>UNPACKED</EM></SPAN> for instance.</P>
<P>In order for BSim to recognize specific function tags, they must be individually
registered with the BSim database. These tags are then automatically ingested into the
database, along with the other standard metadata describing functions, and can be used to
filter match results when querying the database. A function tag has a formal name, which
can be displayed as part of the function header within the main code browser and is used
for BSim columns and filter labels. Once the tag is created for a program, functions
universally have the tag as a Boolean property, either the name applies to a function or
it doesn't, and arbitrary subsets can be <SPAN class="emphasis"><EM>tagged</EM></SPAN>
with that name.</P>
<P>A tag must be <SPAN class="emphasis"><EM>registered</EM></SPAN> with a BSim database
before it can be used as a filter or seen in results. A tag can be registered at any time
with the <SPAN class="command"><STRONG>bsim addfunctiontag</STRONG></SPAN> command.</P>
<DIV class="informalexample">
<TABLE border="0" summary="Simple list" class="simplelist">
<TR>
<TD><CODE class="computeroutput">$(ROOT)/support/bsim addfunctiontag <SPAN class=
"emphasis"><EM>bsimURL</EM></SPAN> MyTagName</CODE></TD>
</TR>
</TABLE>
</DIV>
<P>The new tag will automatically be read in when any new executables are ingested. If
previously ingested executables already had the new tags before they were registered,
their metadata within BSim database can be updated using the <SPAN class=
"command"><STRONG>bsim updaterepo</STRONG></SPAN> command variants. BSim is limited to 29
registered tag names, and there is currently no way to remove a tag once it has been
registered.</P>
</DIV>
<DIV class="sect2">
<DIV class="titlepage">
<DIV>
<DIV>
<H3 class="title"><A name="DatabaseTemplates"></A>Creating Database Templates</H3>
</DIV>
</DIV>
</DIV>
<P>It is possible to create tailored database configuration templates so that
implementors have a permanent and accessible record of a particular set-up and don't need
to repeatedly issue <SPAN class="command"><STRONG>bsim setmetadata</STRONG></SPAN> and
<SPAN class="command"><STRONG>bsim addexecategory</STRONG></SPAN> when creating a
database. Other aspects of a database can also be manipulated, like weighting schemes and
index tuning, but doing this properly is beyond the scope of this document. A <SPAN
class="bold"><STRONG>database template</STRONG></SPAN> is the basic set of configuration
parameters used to set up BSim database instance. The configuration parameters are
established for a particular database when the <SPAN class="command"><STRONG>bsim
createdatabase</STRONG></SPAN> command is run (See <A class="xref" href=
"DatabaseConfiguration.html#CreateDatabase" title="Creating a Database">&ldquo;Creating a
Database&rdquo;</A>). The template name passed on the command-line actually identifies an
XML file-name, appended with the '.xml' suffix, in the directory:</P>
<DIV class="informalexample">
<TABLE border="0" summary="Simple list" class="simplelist">
<TR>
<TD><CODE class="computeroutput">$(ROOT)/Ghidra/Features/BSim/data</CODE></TD>
</TR>
</TABLE>
</DIV>
<P>The file has a root tag of <SPAN class="emphasis"><EM>&lt;dbconfig&gt;</EM></SPAN>,
and the first child tag of this root is the <SPAN class=
"emphasis"><EM>&lt;info&gt;</EM></SPAN> tag. This tag contains all the metadata tags that
can be easily changed or added to the database. A list of the metadata tags follows. The
metadata is provided as formal text content within the tag, and none of the tags
currently take attributes.</P>
<DIV class="informalexample">
<DIV class="table">
<TABLE width="80%" frame="none">
<COL width="30%">
<COL width="70%">
<THEAD>
<TR>
<TD><SPAN class="bold"><STRONG>XML Tag</STRONG></SPAN></TD>
<TD><SPAN class="bold"><STRONG>Description</STRONG></SPAN></TD>
</TR>
</THEAD>
<TBODY>
<TR>
<TD><CODE class="computeroutput">&lt;name&gt;</CODE></TD>
<TD>Name of the database</TD>
</TR>
<TR>
<TD><CODE class="computeroutput">&lt;owner&gt;</CODE></TD>
<TD>Owner of the database</TD>
</TR>
<TR>
<TD><CODE class="computeroutput">&lt;description&gt;</CODE></TD>
<TD>An overarching description of the database</TD>
</TR>
<TR>
<TD><CODE class="computeroutput">&lt;major&gt;</CODE></TD>
<TD>Major decompiler version used for ingest (Should be set to zero)</TD>
</TR>
<TR>
<TD><CODE class="computeroutput">&lt;minor&gt;</CODE></TD>
<TD>Minor decompiler version used for ingest (Should be set to zero)</TD>
</TR>
<TR>
<TD><CODE class="computeroutput">&lt;settings&gt;</CODE></TD>
<TD>Specific settings for the signature strategy (Should be set to zero)</TD>
</TR>
<TR>
<TD><CODE class="computeroutput">&lt;execategory&gt;</CODE></TD>
<TD>The name of an executable category (to be) defined for this database
instance</TD>
</TR>
<TR>
<TD><CODE class="computeroutput">&lt;datename&gt;</CODE></TD>
<TD>The name of the timestamp column</TD>
</TR>
<TR>
<TD><CODE class="computeroutput">&lt;functiontag&gt;</CODE></TD>
<TD>The name of a function tag (to be) registered with this database
instance</TD>
</TR>
</TBODY>
</TABLE>
</DIV>
</DIV>
<P>There can be multiple <SPAN class="emphasis"><EM>&lt;execategory&gt;</EM></SPAN> tags
if more than one category is desired and both <SPAN class=
"emphasis"><EM>&lt;execategory&gt;</EM></SPAN> and <SPAN class=
"emphasis"><EM>&lt;datename&gt;</EM></SPAN> are optional tags. The date column name
defaults to 'Ingest Date' and is drawn from the corresponding Ghidra program option. The
tag order needs to be preserved. There can be multiple <SPAN class=
"emphasis"><EM>&lt;functiontag&gt;</EM></SPAN> tags, one for each function tag to be
registered with the database.</P>
<P>It is easiest to copy an existing template and just edit the tags described above. The
remaining tags in the file are more dangerous to manipulate. The <SPAN class=
"emphasis"><EM>&lt;k&gt;</EM></SPAN> and <SPAN class="emphasis"><EM>&lt;L&gt;</EM></SPAN>
tags pertain to the index tuning. The <SPAN class=
"emphasis"><EM>&lt;weightsfile&gt;</EM></SPAN> tag gives the name of the weights file,
within the same directory, which is also another XML file. It is simplest to choose from
the existing weight files provided with the distribution. See <A class="xref" href=
"FeatureWeight.html#WeightingSoftware" title=
"Weighting Software Features">&ldquo;Weighting Software Features&rdquo;</A>.</P>
</DIV>
</DIV>
</DIV>
</BODY>
</HTML>

View File

@ -0,0 +1,258 @@
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN">
<HTML>
<HEAD>
<META name="generator" content=
"HTML Tidy for Java (vers. 2009-12-01), see jtidy.sourceforge.net">
<META http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">
<TITLE>Features and Weights</TITLE>
<LINK rel="stylesheet" type="text/css" href="help/shared/DefaultStyle.css">
<LINK rel="stylesheet" type="text/css" href="../../shared/languages.css">
<META name="generator" content="DocBook XSL Stylesheets V1.79.1">
<LINK rel="home" href="index.html" title="BSim Database">
<LINK rel="up" href="index.html" title="BSim Database">
<LINK rel="prev" href="DatabaseQuery.html" title="Querying a BSim Database">
<LINK rel="next" href="CommandLineReference.html" title="Command-Line Utility Reference">
</HEAD>
<BODY>
<DIV class="chapter">
<DIV class="titlepage">
<DIV>
<DIV>
<H1 class="title"><A name="FeatureWeight"></A>Features and Weights</H1>
</DIV>
</DIV>
</DIV>
<DIV class="section">
<DIV class="titlepage">
<DIV>
<DIV>
<H2 class="title" style="clear: both"><A name="FunctionFeatures"></A>Features of
Software Functions</H2>
</DIV>
</DIV>
</DIV>
<P>The BSim Database uses a standard <SPAN class="bold"><STRONG>Feature
Vector</STRONG></SPAN> approach to compare and index software functions. A <SPAN class=
"bold"><STRONG>feature</STRONG></SPAN> is an abstraction that simply means a single element
or attribute that can be compared quantitatively between two objects. The set of possible
features used by a particular approach is fixed, and any object being examined is viewed as
some unordered subset of all the possible features. So features are the smallest (atomic)
aspect of an object that can be measured, either two objects share a feature in common or
they do not. But within this scheme, because objects generally consist of many individual
features, quantitative fine-grained comparisons can be automatically calculated.</P>
<P>The BSim Database extracts its features from the data-flow representation of a function,
after it has been normalized by the Ghidra decompiler. This is the SSA graph representation
of the function, with nodes representing the variables and operators of the function, and
the edges representing the read/write relationships between them. An individual feature is
just a portion of this graph, encompassing some subset of variables and operators and the
specific flow between them. Because of the decompilation, a feature can be viewed naturally
as a uniform snippet of C source code, a partial extraction of some expression in the
source code representation of the function. The full set of features provides uniform (and
overlapping) coverage of the graph representation of the entire function.</P>
<P>Features encode specific aspects of the variables they cover but not others. The size of
a variable, the operator that produced it, and the set of operators it is fed into are
encoded in the features. But, any name assigned to the variable, its data-type, or even its
storage location are <SPAN class="emphasis"><EM>not</EM></SPAN> encoded in the
features.</P>
<P>Within a function, details about the specific subfunctions that it calls are not encoded
in any of the features for that function, but information describing where the call is made
and the set of parameters it takes is encoded.</P>
</DIV>
<DIV class="section">
<DIV class="titlepage">
<DIV>
<DIV>
<H2 class="title" style="clear: both"><A name="WeightingSoftware"></A>Weighting
Software Features</H2>
</DIV>
</DIV>
</DIV>
<P>Some features are more useful for identifying a specific function out of a large corpus
than others. With the view that features are just portions of recovered C expressions, some
C expressions are simply more common than others. The BSim Database compensates for these
differences by assigning a weight to each feature that factors in to the similarity and
confidence scores produced when comparing functions. Weighting schemes are considered a
configuration parameter of the database and are established for a particular database when
it is created. The scheme cannot be changed without creating an entirely new database and
reingesting the functions.</P>
<P>Ghidra comes with precomputed weighting schemes that are calculated using statistics
drawn from homogeneous collections of systems and application software. A feature's weight
is computed by counting the number of times it occurs across the entire corpus and
comparing this with the counts from other features. This allows a direct computation of the
information content of the feature; quantitatively, how much have we narrowed down a
particular function from the corpus when we know it contains a particular feature.</P>
<P>The two primary weighting schemes are called <SPAN class=
"bold"><STRONG>32</STRONG></SPAN> and <SPAN class="bold"><STRONG>64</STRONG></SPAN>, based
on 32-bit code and one on 64-bit code respectively. This means that a particular database
instance has better sensitivity for either 32-bit or 64-bit functions. The quantitative
scores, similarity and confidence, will be more accurate at distinguishing pairs of
functions from one corpus. This does not mean that functions from the <SPAN class=
"emphasis"><EM>wrong</EM></SPAN> group cannot be ingested or queried, but the scores may
not be as accurate. There is also a <SPAN class="bold"><STRONG>64_32</STRONG></SPAN>
weighting scheme for architectures where code is compiled to use 64-bit registers but
addresses are still 32-bit.</P>
<P>The specialized weighting scheme <SPAN class="bold"><STRONG>nosize</STRONG></SPAN>
allows BSim to match between 32-bit and 64-bit implementations of a function. It works by
making feature hashes blind to the size difference between a 32-bit variable versus a
64-bit variable. This compensates for a compiler's tendency to assign a full 64-bit
register to a 32-bit variable, which is frequently difficult for the decompiler to
automatically resolve in the context of a single function. Because of this blindness, there
is a slight loss of sensitivity, when matching 32-bit to 32-bit functions, or when matching
64-bit to 64-bit, over the <SPAN class="bold"><STRONG>32</STRONG></SPAN> or <SPAN class=
"bold"><STRONG>64</STRONG></SPAN> schemes respectively.</P>
<P>The weighting scheme <SPAN class="bold"><STRONG>cpool</STRONG></SPAN> should be used for
run-time compilation (JIT) architectures, like Java Dalvik or <SPAN class=
"emphasis"><EM>.class</EM></SPAN> byte-code executables. These architectures use
characteristic <SPAN class="emphasis"><EM>constant pool</EM></SPAN> instructions that delay
exact decisions about code and data layout until runtime. The decompiler can still recover
data-flow effectively by treating these instructions as black-box operations, so BSim works
in the same way as with native code. But a specialized weighting scheme is needed to
balance BSim's sensitivity to these operations.</P>
</DIV>
<DIV class="section">
<DIV class="titlepage">
<DIV>
<DIV>
<H2 class="title" style="clear: both"><A name="CompareVectors"></A>Comparing Feature
Vectors</H2>
</DIV>
</DIV>
</DIV>
<P>For a particular function, the set of extracted features and their assigned weights make
up the formal <SPAN class="bold"><STRONG>feature vector</STRONG></SPAN> associated with the
function. When querying a BSim Database, the primary function search is performed by
comparing feature vectors. There are two formal scores that are computed on a pair of
feature vectors, <SPAN class="emphasis"><EM>similarity</EM></SPAN> and <SPAN class=
"emphasis"><EM>confidence</EM></SPAN>.</P>
<DIV class="sect2">
<DIV class="titlepage">
<DIV>
<DIV>
<H3 class="title"><A name="Similarity"></A>Similarity</H3>
</DIV>
</DIV>
</DIV>
<P>Similarity is a direct calculation of the percentage of features in common between two
functions. It varies continuously from 0.0, meaning the functions share no features at
all, to 1.0, meaning that the functions have the same feature set. Formally, similarity
is defined as the <SPAN class="emphasis"><EM>cosine similarity</EM></SPAN> of the two
feature vectors. Weights determine how important individual features are in the score
relative to other features, providing a practical and realistic meaning to the score. Two
functions can exhibit a few unimportant changes, but the similarity can still be very
high because the differences are likely not weighted heavily. Along the same lines, two
functions can share most of their features but have a low similarity because they differ
in more important features.</P>
<P>When searching for a function, the database sets a particular threshold on similarity,
0.7 by default, and returns functions whose similarity with the queried function exceeds
that threshold. This can produce <SPAN class="emphasis"><EM>false positive</EM></SPAN>
matches for small functions because a small function is described by just a few features
and it is then comparatively easy to randomly match a high percentage of these features.
Deciding if a false positive is likely can be decided quantitatively by examining the
<SPAN class="emphasis"><EM>confidence</EM></SPAN> score below.</P>
</DIV>
<DIV class="sect2">
<DIV class="titlepage">
<DIV>
<DIV>
<H3 class="title"><A name="Confidence"></A>Confidence</H3>
</DIV>
</DIV>
</DIV>
<P>Confidence is a log likelihood ratio, a weighted count of the set of features that
match between two functions minus the set of features that are different. It is an
open-ended score, and the higher it gets, the more likely it is that the two functions
are a true match. Fixing a threshold for the confidence score provides a more consistent
<SPAN class="emphasis"><EM>false positive</EM></SPAN> rate, as opposed to thresholding on
similarity. A higher score means the two functions have more features in common as an
absolute count, not just a higher percentage. So the chance of randomly matching most of
the features continues to go down as confidence goes up.</P>
<P>A general correspondence between low confidence scores and false positive rates can be
somewhat skewed by <SPAN class="emphasis"><EM>wrappers</EM></SPAN> and other small
functions, which are always common but whose specific frequency can vary depending on the
type of software. BSim fixes the score 10.0 for a particular wrapper form, providing a
convenient boundary between wrappers and more substantial functions where frequencies are
more consistent. For scores of 10.0 and greater, we get the following rough
correspondence with false positive rate. The rate drops by a factor of 2 for an increase
in confidence of between 4 and 5 points.</P>
<DIV class="informalexample">
<DIV class="table">
<A name="falsepositive.htmltable"></A>
<TABLE width="70%" frame="none">
<COL width="30%">
<COL width="70%">
<THEAD>
<TR>
<TD><SPAN class="bold"><STRONG>Confidence</STRONG></SPAN></TD>
<TD><SPAN class="bold"><STRONG>False Positive Rate
(Approximate)</STRONG></SPAN></TD>
</TR>
</THEAD>
<TBODY>
<TR>
<TD>10</TD>
<TD>1 in 4,000</TD>
</TR>
<TR>
<TD>26</TD>
<TD>1 in 100,000</TD>
</TR>
<TR>
<TD>43</TD>
<TD>1 in 1,000,000</TD>
</TR>
<TR>
<TD>93</TD>
<TD>1 in 1,000,000,000</TD>
</TR>
</TBODY>
</TABLE>
</DIV>
</DIV>
<P>For a single function, there is an upper-bound to the confidence that can be achieved
by a possible match, its <SPAN class="emphasis"><EM>self significance</EM></SPAN>. This
upper-bound is of course reached by comparison with a function having 1.0 similarity.
Self significance is roughly proportional to the size of the function. So its impossible
to achieve a high confidence for a small function, for single matches viewed in
isolation. Of course a medium to low confidence threshold may be enough to produce a
unique match if the database is small, and a medium to high confidence threshold may
still produce occasional false positives if the database is very large.</P>
</DIV>
</DIV>
</DIV>
</BODY>
</HTML>

Some files were not shown because too many files have changed in this diff Show More