Settings Results in 4 milliseconds

What are the best top 10 website for Creative Comm ...
Category: Research

Title Top 10 Websites for Discovering and Utilizing Creative Commons Content Unleashing Your C ...


Views: 0 Likes: 50
Chapter 5: Self-Care and Coping Strategies
Category: Lipedema

Living with lipedema can be challenging, but there are strategies individuals can employ to impro ...


Views: 0 Likes: 12
How to Connect to Linux Ubuntu 20 using Remote Des ...
Category: Lunux

Question How to log in to <a class="text-decoration-none" href="https//www.ern ...


Views: 0 Likes: 35
NT-Xent (Normalized Temperature-Scaled Cross-Entropy) Loss Explained and Implemented in PyTorch
NT-Xent (Normalized Temperature-Scaled Cross-Entro ...

An intuitive explanation of the NT-Xent loss with a step-by-step explanation of the operation and our implementation in PyTorchCo-authored with Naresh Singh.Formula for NT-Xent loss. Source Papers with code (CC-BY-SA)IntroductionRecent advances in self-supervised learning and contrastive learning have excited researchers and practitioners in Machine Learning (ML) to explore this space with renewed interest.In particular, the SimCLR paper that presents a simple framework for contrastive learning of visual representations has gained a lot of attention in the self-supervised and contrastive learning space.The central idea behind the paper is very simple?—?allow the model to learn if a pair of images were derived from the same or different initial image.Figure 1 The high-level idea behind SimCLR. Source SimCLR paperThe SimCLR approach encodes each input image i as a feature vector zi. There are 2 cases to considerPositive Pairs The same image is augmented using a different set of augmentations, and the resulting feature vectors zi and zj are compared. These feature vectors are forced to be similar by the loss function.Negative Pairs Different images are augmented using a different set of augmentations, and the resulting feature vectors zi and zk are compared. These feature vectors are forced to be dissimilar by the loss function.The rest of this article will focus on explaining and understanding this loss function, and its efficient implementation using PyTorch.The NT-Xent LossAt a high level, the contrastive learning model is fed 2N images, originating from N underlying images. Each of the N underlying images is augmented using a random set of image augmentations to produce 2 augmented images. This is how we end up with 2N images in a single train batch fed to the model.Figure 2 A batch of 6 images in a single training batch for contrastive learning. The number below each image is the index of that image in the input batch when fed into a contrastive learning model. Image Source Oxford Visual Geometry Group (CC-SA).In the following sections, we will dive deep into the following aspects of the NT-Xent loss.The effect of temperature on SoftMax and SigmoidA simple and intuitive interpretation of the NT-Xent lossA step-by-step implementation of NT-Xent in PyTorchMotivating the need for a multi-label loss function (NT-BXent)A step-by-step implementation of NT-BXent in PyTorchAll the code for steps 2–5 can be found in this notebook. The code for step-1 can be found in this notebook.The effect of temperature on SoftMax and SigmoidTo understand all the moving parts of the contrastive loss function we’ll be studying in this article, we need to first understand the effect of temperature on the SoftMax and Sigmoid activation functions.Typically, temperature scaling is applied to the input to SoftMax or Sigmoid to either smooth out or accentuate the output of those activation functions. The input logits are divided by the temperature before passing into the activation functions. You can find all the code for this section in this notebook.SoftMax For SoftMax, a high temperature reduces the variance in the output distribution which results in softening of the labels. A low temperature increases the variance in the output distribution and makes the maximum value stand out over the other values. See the charts below for the effect of temperature on SoftMax when fed with the input tensor [0.1081, 0.4376, 0.7697, 0.1929, 0.3626, 2.8451].Figure 3 Effect of temperature on SoftMax. Source Author(s)Sigmoid For Sigmoid, a high-temperature results in an output distribution that is pulled towards 0.0, whereas a low temperature stretches the inputs to higher values, stretching the outputs to be closer to either 0.0 or 1.0 depending on the unsigned magnitude of the input.Figure 4 Effect of temperature on Sigmoid. Source Author(s)Now that we understand the effect of various temperature values on the SoftMax and Sigmoid functions, let’s see how this applies to our understanding of the NT-Xent loss.Interpreting the NT-Xent lossThe NT-Xent loss is understood by understanding the individual terms in the name of this loss.Normalized Cosine similarity produces a normalized score in the range [-1.0 to +1.0]Temperature-scaled The all-pairs cosine similarity is scaled by a temperature before computing the cross-entropy lossCross-entropy loss The underlying loss is a multi-class (single-label) cross-entropy lossAs mentioned above, we assume that for a batch of size 2N, the feature vectors at the following indices represent positive pairs (0, 1), (2, 3), (4, 5), (6, 7), … and the rest of the combinations represent negative pairs. This is an important factor to keep in mind throughout the interpretation of the NT-Xent loss as it relates to SimCLR.Now that we understand what the terms mean in the context of the NT-Xent loss, let’s take a look at the mechanical steps needed to compute the NT-Xent loss on a batch of feature vectors.The all-pairs Cosine Similarity score is computed for each of the 2N vectors produced by the SimCLR model. This results in (2N)² similarity scores represented as a 2N x 2N matrixComparison results between the same value (i, i) are discarded (since a distribution is perfectly similar to itself and can’t possibly allow the model to learn anything useful)Each value (cosine similarity) is scaled by a temperature parameter ?? (which is a hyper-parameter)Cross-entropy loss is applied to each row of the resulting matrix above. The following paragraph explains more in detailTypically, the mean of these losses (one loss per element in a batch) is used for backpropagationThe way that the cross-entropy loss is used here is semantically slightly different from how it’s used in standard classification tasks. In classification tasks, a final “classification head” is trained to produce a one-hot-probability vector for each input, and we compute the cross-entropy loss on that one-hot-probability vector since we’re effectively computing the difference between 2 distributions. This video explains the concept of cross-entropy loss beautifully. In the NT-Xent loss, there isn’t a 11 correspondence between a trainable layer and the output distribution. Instead, a feature vector is computed for each input, and we then compute the cosine similarity between every pair of feature vectors. The trick here is that since each image is similar to exactly 1 other image in the input batch (positive pair) (if we ignore the similarity of a feature vector with itself), we can consider this to be a classification-like setting where the probability distribution of the similarity probability between images represents a classification task where one of them will be close to 1.0 and the rest will be close to 0.0.Now that we have a solid overall understanding of the NT-Xent loss, we should be in great shape to implement these ideas in PyTorch. Let’s get going!Implementation of NT-Xent loss in PyTorchAll the code in this section can be found in this notebook.Code Reuse Many implementations of the NT-Xent loss seen online implement all the operations from scratch. Furthermore, some of them implement the loss function inefficiently, preferring to use for loops instead of GPU parallelism. Instead, we will use a different approach. We’ll implement this loss in terms of the standard cross-entropy loss that PyTorch already provides. To do this, we need to massage the predictions and ground-truth labels in a format that cross_entropy can accept. Let’s see how to do this below.Predictions Tensor First, we need to create a PyTorch tensor that will represent the output from our contrastive learning model. Let’s assume that our batch size is 8 (2N=8), and our feature vectors have 2 dimensions (2 values). We’ll call our input variable “x”.x = torch.randn(8, 2)Cosine Similarity Next, we’ll compute the all-pairs cosine similarity between every feature vector in this batch and store the result in the variable named “xcs”. If the line below seems confusing, please read the details on this page. This is the “normalize” step.xcs = F.cosine_similarity(x[None,,], x[,None,], dim=-1)As mentioned above, we need to ignore the self-similarity score of every feature vector since it doesn’t contribute to the model’s learning and will be an unnecessary nuisance later on when we want to compute the cross-entropy loss. For this purpose, we’ll define a variable “eye” which is a matrix with the elements on the principal diagonal having a value of 1.0 and the rest being 0.0. We can create such a matrix using the following command.eye = torch.eye(8)Now let’s convert this into a boolean matrix so that we can index into the “xcs” variable using this mask matrix.eye = eye.bool()Let’s clone the tensor “xcs” into a tensor named “y” so that we can reference the “xcs” tensor later.y = xcs.clone()Now, we will set the values along the principal diagonal of the all-pairs cosine similarity matrix to -inf so that when we compute the softmax on each row, this value will contribute nothing.y[eye] = float("-inf")The tensor “y” scaled by a temperature parameter will be one of the inputs (predictions) to the cross-entropy loss API in PyTorch. Next, we need to compute the ground-truth labels (target) that we need to feed to the cross-entropy loss API.Ground Truth labels (Target tensor) For the example we are using (2N=8), this is what the ground-truth tensor should look like.tensor([1, 0, 3, 2, 5, 4, 7, 6])That’s because the following index pairs in the tensor “y” contain positive pairs.(0, 1), (1, 0)(2, 3), (3, 2)(4, 5), (5, 4)(6, 7), (7, 6)To interpret the index pairs above, we look at a single example. The pair (4, 5) means that column 5 at row 4 is supposed to be set to 1.0 (positive pair), which is what the tensor above is also saying. Great!To create the tensor above, we can use the following PyTorch code, which stores the ground-truth labels in the variable “target”.target = torch.arange(8)target[02] += 1target[12] -= 1cross-entropy Loss We have all the ingredients we need to compute our loss! The only thing that remains to be done is to call the cross_entropy API in PyTorch.loss = F.cross_entropy(y / temperature, target, reduction="mean")The variable “loss” now contains the computed NT-Xent loss. Let’s wrap all the code in a single python function below.def nt_xent_loss(x, temperature) assert len(x.size()) == 2 # Cosine similarity xcs = F.cosine_similarity(x[None,,], x[,None,], dim=-1) xcs[torch.eye(x.size(0)).bool()] = float("-inf") # Ground truth labels target = torch.arange(8) target[02] += 1 target[12] -= 1 # Standard cross-entropy loss return F.cross_entropy(xcs / temperature, target, reduction="mean")The code above works as long as each feature vector has exactly one positive pair in the batch when training our contrastive learning model. Let’s take a look at how to handle multiple positive pairs in a contrastive learning task.A multi-label loss for contrastive learning NT-BXentIn the SimCLR paper, every image i has exactly 1 similar pair at index j. This makes cross-entropy loss a perfect choice for the task since it resembles a multi-class problem. Instead, if we have M > 2 augmentations of the same image fed into the contrastive learning model’s single training batch, then each batch would have image M-1 similar pairs for image i. This task would resemble a multi-label problem.The obvious choice would be to replace cross-entropy loss with binary cross-entropy loss. Hence the name NT-BXent loss, which stands for Normalized Temperature-scaled Binary cross-entropy Loss.The formulation below shows the loss Li for the element i. The s in the formula below stands for the Sigmoid function.Figure 5 Formulation for the NT-BXent loss. Image source Author(s) of this articleTo avoid the class imbalance problem, we weigh the positive and negative pairs by the inverse of the number of positive and negative pairs in our mini-batch. The final loss in the mini-batch used for backpropagation will be the mean of the losses of each sample in our mini-batch.Next, let’s focus our attention on our implementation of the NT-BXent loss in PyTorch.Implementation of NT-BXent loss in PyTorchAll the code in this section can be found in this notebook.Code Reuse Similar to our implementation of the NT-Xent loss, we shall re-use the Binary Cross-entropy (BCE) loss method provided by PyTorch. The setup of our ground-truth labels will be similar to that of a multi-label classification problem where BCE loss is used.Predictions Tensor We’ll use the same (8, 2) predictions tensor as we used for the implementation of the NT-Xent loss.x = torch.randn(8, 2)Cosine Similarity Since the input tensor x is same, the all-pairs cosine similarity tensor xcs will also be the same. Please see this page for a detailed explanation of what the line below does.xcs = F.cosine_similarity(x[None,,], x[,None,], dim=-1)To ensure that the loss from the element at position (i, i) is 0, we’ll need to perform some gymnastics to have our xcs tensor contain a value 1 at every index (i, i) after Sigmoid is applied to it. Since we’ll be using BCE Loss, we will mark the self-similarity score of every feature vector with the value infinity in tensor xcs. That’s because applying the sigmoid function on the xcs tensor, will convert infinity to the value 1, and we will set up our ground-truth labels so that every position (i, i) in the ground-truth labels has the value 1.Let’s create a masking tensor that has the value True along the principal diagonal (xcs has self-similarity scores along the principal diagonal), and False everywhere else.eye = torch.eye(8).bool()Let’s clone the tensor “xcs” into a tensor named “y” so that we can reference the “xcs” tensor later.y = xcs.clone()Now, we will set the values along the principal diagonal of the all-pairs cosine similarity matrix to infinity so that when we compute the sigmoid on each row, we get 1 in these positions.y[eye] = float("inf")The tensor “y” scaled by a temperature parameter will be one of the inputs (predictions) to the BCE loss API in PyTorch. Next, we need to compute the ground-truth labels (target) that we need to feed to the BCE loss API.Ground Truth labels (Target tensor) We will expect the user to pass to us the pair of all (x, y) index pairs which contain positive examples. This is a departure for what we did for the NT-Xent loss, since the positive pairs were implicit, whereas here, the positive pairs are explicit.In addition to the locations provided by the user, we will set all the diagonal elements as positive pairs as explained above. We will use the PyTorch tensor indexing API to pluck out all the elements at those locations and set them to 1, whereas the rest are initialized to 0.target = torch.zeros(8, 8)pos_indices = torch.tensor([ (0, 0), (0, 2), (0, 4), (1, 4), (1, 6), (1, 1), (2, 3), (3, 7), (4, 3), (7, 6),])# Add indexes of the principal diagonal as positive indexes.# This will be useful since we will use the BCELoss in PyTorch,# which will expect a value for the elements on the principal# diagonal as well.pos_indices = torch.cat([pos_indices, torch.arange(8).reshape(8, 1).expand(-1, 2)], dim=0)# Set the values in the target vector to 1.target[pos_indices[,0], pos_indices[,1]] = 1Binary cross-entropy (BCE) Loss Unlike the NT-Xent loss, we can’t simply call the torch.nn.functional.binary_cross_entropy_function, since we want to weigh the positive and negative loss based on how many positive and negative pairs the element at index i has in the current mini-batch.The first step though is to compute the element-wise BCE loss.temperature = 0.1loss = F.binary_cross_entropy((y / temperature).sigmoid(), target, reduction="none")We’ll create a binary mask of positive and negative pairs and then create 2 tensors, loss_pos and loss_neg that contain only those elements from the computed loss that correspond to the positive and negative pairs.target_pos = target.bool()target_neg = ~target_pos# loss_pos and loss_neg below contain non-zero values only for those elements# that are positive pairs and negative pairs respectively.loss_pos = torch.zeros(x.size(0), x.size(0)).masked_scatter(target_pos, loss[target_pos])loss_neg = torch.zeros(x.size(0), x.size(0)).masked_scatter(target_neg, loss[target_neg])Next, we’ll sum up the positive and negative pair loss (separately) corresponding to each element i in our mini-batch.# loss_pos and loss_neg now contain the sum of positive and negative pair losses# as computed relative to the i'th input.loss_pos = loss_pos.sum(dim=1)loss_neg = loss_neg.sum(dim=1)To perform weighting, we need to track the number of positive and negative pairs corresponding to each element i in our mini-batch. Tensors “num_pos” and “num_neg” will store these values.# num_pos and num_neg below contain the number of positive and negative pairs# computed relative to the i'th input. In an actual setting, this number should# be the same for every input element, but we let it vary here for maximum# flexibility.num_pos = target.sum(dim=1)num_neg = target.size(0) - num_posWe have all the ingredients we need to compute our loss! The only thing that we need to do is weigh the positive and negative loss by the number of positive and negative pairs, and then average the loss across the mini-batch.def nt_bxent_loss(x, pos_indices, temperature) assert len(x.size()) == 2 # Add indexes of the principal diagonal elements to pos_indices pos_indices = torch.cat([ pos_indices, torch.arange(x.size(0)).reshape(x.size(0), 1).expand(-1, 2), ], dim=0) # Ground truth labels target = torch.zeros(x.size(0), x.size(0)) target[pos_indices[,0], pos_indices[,1]] = 1.0 # Cosine similarity xcs = F.cosine_similarity(x[None,,], x[,None,], dim=-1) # Set logit of diagonal element to "inf" signifying complete # correlation. sigmoid(inf) = 1.0 so this will work out nicely # when computing the Binary cross-entropy Loss. xcs[torch.eye(x.size(0)).bool()] = float("inf") # Standard binary cross-entropy loss. We use binary_cross_entropy() here and not # binary_cross_entropy_with_logits() because of # https//github.com/pytorch/pytorch/issues/102894 # The method *_with_logits() uses the log-sum-exp-trick, which causes inf and -inf values # to result in a NaN result. loss = F.binary_cross_entropy((xcs / temperature).sigmoid(), target, reduction="none") target_pos = target.bool() target_neg = ~target_pos loss_pos = torch.zeros(x.size(0), x.size(0)).masked_scatter(target_pos, loss[target_pos]) loss_neg = torch.zeros(x.size(0), x.size(0)).masked_scatter(target_neg, loss[target_neg]) loss_pos = loss_pos.sum(dim=1) loss_neg = loss_neg.sum(dim=1) num_pos = target.sum(dim=1) num_neg = x.size(0) - num_pos return ((loss_pos / num_pos) + (loss_neg / num_neg)).mean()pos_indices = torch.tensor([ (0, 0), (0, 2), (0, 4), (1, 4), (1, 6), (1, 1), (2, 3), (3, 7), (4, 3), (7, 6),])for t in (0.01, 0.1, 1.0, 10.0, 20.0) print(f"Temperature {t5.2f}, Loss {nt_bxent_loss(x, pos_indices, temperature=t)}")Prints.Temperature 0.01, Loss 62.898780822753906Temperature 0.10, Loss 4.851151943206787Temperature 1.00, Loss 1.0727109909057617Temperature 10.00, Loss 0.9827173948287964Temperature 20.00, Loss 0.982099175453186ConclusionSelf-supervised learning is an upcoming field in deep learning and allows one to train models on unlabeled data. This technique lets us work around the requirement of labeled data at scale.In this article, we learned about loss functions for contrastive learning. The first one, named NT-Xent loss, is used for learning on a single positive pair per input in a mini-batch. We introduced the NT-BXent loss which is used for learning on multiple (> 1) positive pairs per input in a mini-batch. We learned to interpret them intuitively, building on our knowledge of cross-entropy loss and binary cross-entropy loss. Finally, we implemented them both efficiently in PyTorch.NT-Xent (Normalized Temperature-Scaled Cross-Entropy) Loss Explained and Implemented in PyTorch was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.


Anomaly Detection Using Sigma Rules: Build Your Own Spark Streaming Detections
Anomaly Detection Using Sigma Rules Build Your Ow ...

Easily deploy Sigma rules in Spark streaming pipelines a future-proof solution supporting the upcoming Sigma 2 specificationPhoto by Dana Walker on UnsplashIn our previous articles we elaborated and designed a stateful function named flux-capacitor.The flux-capacitor stateful function that can remember parent-child (and ancestor) relationships between log events. It can also remember events occurring on the same host in a certain window of time, the Sigma specification refers to this as temporal proximity correlation .For a deep-dive into the design of flux-capacitor refer to part 1 , part 2, part 3, part 4, and part5. However, you don’t need to understand the implementation of the function to use it.In this article we first show a Spark streaming job which performs discreet detections. A discreet detection is a Sigma rule which uses the features and values of a single log line (a single event).Then we leverage the flux-capacitor function to handle stateful parent-child relationships between log events. The flux-capacitor is also able to detect a number of events occurring on the same host in a certain window of time; these are called temporal proximity correlation in the upcoming Sigma specification. A complete demo of these spark streaming jobs is available in our git repo .Discreet DetectionsPerforming discrete tests is fairly straightforward, thanks to all the built-in functions that come out-of-the-box in Spark. Spark has support for reading streaming sources, writing to sinks, checkpointing, stream-stream joins, windowed aggregations and many more. For a complete list of the possible functionalities, see the comprehensive Spark Structured Streaming Programming Guide.Here’s a high level diagram showing a Spark streaming job that consumes events from an Iceberg table of “start-process” windows events (1). A classic example of this is found in Windows Security Logs (Event ID 4688).Topology for discrete detectionsThe source table (1) is named process_telemetry_table. The Spark job reads all events, detects anomalous events, tags these events and writes them to table (3) named tagged_telemetry_table. Events deemed anomalous are also written to a table (4) containing alerts.Periodically we poll a git repository (5) containing the SQL auto-generated from the Sigma rules we want to apply. If the SQL statements change, we restart the streaming job to add these new detections to the pipeline.Let’s take this Sigma rule as an examplescreenshot from proc_creation_win_rundll32_sys.yml at Sigma HQThe detection section is the heart of the Sigma rule and consists of a condition and 1 or more named tests. The selection1 and selection2 are named boolean tests. The author of the Sigma rule can give meaniningful names to these tests. The condition is where the user can combine the tests in a final evaluation. See the Sigma specification for more details on writing a Sigma rule.From now on we will refer to these named boolean tests as tags.The inner workings of the Spark streaming job is broken down into 4 logical stepsread the source table process_telemetry_tableperform pattern matchingevaluate final conditionwrite the resultsThe Pattern Match step consist of evaluating the tags found in the Sigma rule and the Eval final condition evaluates thecondition.On the right of this diagram we show what the row would look like at this stage of processing. The columns in blue represent values read from the source table. The Pattern Match step adds a column named Sigma tags which is a map of all the tests performed and whether the test passed or failed. The gray column contains the final Sigma rule evaluations. Finally, the brown columns are added in the foreachBatch function. A GUID is generated, the rule names that are true are extracted from the Sigma tags map and the detection action is retrieved from a lookup map of rule-name to rule-type. This gives context to the alerts produced.This diagram depicts how attributes of the event are combined into tags, final evaluation and finally contextual information.Let’s now look at the actual pyspark code. First, we connect spark to the source table using the readStream function and specifying the name from which the iceberg table is read. The load function returns a dataframe, which we use to create a view named process_telemetry_view.spark .readStream .format("iceberg") .option("stream-from-timestamp", ts) .option("streaming-skip-delete-snapshots", True) .option("streaming-skip-overwrite-snapshots", True) .load(constants.process_telemetry_table) .createOrReplaceTempView("process_telemetry_view")The data in the process_telemetry_view looks like this+-------------------+---+---------+---------------------+ |timestamp |id |parent_id|Commandline |+-------------------+---+---------+---------------------+|2022-12-25 000001|11 |0 | ||2022-12-25 000002|2 |0 |c\winotepad.exe ||2022-12-25 000003|12 |11 | ||2022-12-25 000008|201|200 |cmdline and args ||2022-12-25 000009|202|201 | ||2022-12-25 000010|203|202 |c\test.exe |+-------------------+---+---------+---------------------+On this view we apply a Pattern Matching step which consists of an auto-generated SQL statement produced by the Sigma compiler. The patern_match.sql file looks like thisselect *, -- regroup each rule's tags in a map (ruleName -> Tags) map( 'rule0', map( 'selection1', (CommandLine LIKE '%rundll32.exe%'), 'selection2', (CommandLine LIKE '%.sys,%' OR CommandLine LIKE '%.sys %'), ) ) as sigmafrom process_telemetry_viewWe use spark.sql() to apply this statement to the process_telemetry_view view.df = spark.sql(render_file("pattern_match.sql"))df.createOrReplaceTempView("pattern_match_view")Notice that the results of each tag found in the Sigma rule are stored in a map of boolean values. The sigma column holds the results of each tag found in each Sigma rule. By using a MapType we can easily introduce new Sigma rules without affecting the schema of the table. Adding a new rule simply adds a new entry in the sigmacolumn (a MapType) .+---+---------+---------------------+----------------------------------+|id |parent_id|Commandline |sigma+---+---------+---------------------+----------------------------------+|11 |0 | |{rule0 -> { selection1 -> false, selection2 -> false }, }Similarly, the Eval final condition step applies the conditions from the Sigma rules. The conditions are compiled into an SQL statement, which use map, map_filter, map_keys, to build a column named sigma_final. This column holds the name of all the rules that have a condition that evaluates to true.select *, map_keys( -- only keep the rule names of rules that evaluted to true map_filter( -- filter map entries keeping only rules that evaluated to true map( -- store the result of the condition of each rule in a map 'rule0', -- rule 0 -> condition all of selection* sigma.rule0.selection1 AND sigma.rule0.selection2) ) , (k,v) -> v = TRUE)) as sigma_finalfrom pattern_match_viewThe auto-generated statement is applied using spark.sql().df = spark.sql(render_file("eval_final_condition.sql"))Here’s the results with the newly added sigma_final column, an array of rules that fire.+---+---------+-------------------------------------+-------------+|id |parent_id|sigma | sigma_final |+---+---------+-------------------------------------+-------------+|11 |0 |{rule0 -> { | [] | selection1 -> false, selection2 -> false } }We are now ready to start the streaming job for our dataframe. Notice that we pass in a call back function for_each_batch_function to the foreachBatch.streaming_query = ( df .writeStream .queryName("detections") .trigger(processingTime=f"{trigger} seconds") .option("checkpointLocation", get_checkpoint_location(constants.tagged_telemetry_table) ) .foreachBatch(foreach_batch_function) .start() )streaming_query.awaitTermination()The for_each_batch_function is called at every micro-batch and is given the evaluated batchdf dataframe. The for_each_batch_function writes the entirety of batchdf to the tagged_telementry_table and also writes alerts for any of the Sigma rules that evaluated to true.def foreach_batch_function(batchdf, epoch_id) # Transform and write batchDF batchdf.persist() batchdf.createOrReplaceGlobalTempView("eval_condition_view") run("insert_into_tagged_telemetry") run("publish_suspected_anomalies") spark.catalog.clearCache()The details of insert_into_tagged_telemetry.sql and publish_suspected_anomalies.sql can be found in our git repo.As mentioned above, writing a streaming anomaly detection handling discreet test is relatively straightforward using the built-in functionality found in Spark.Detections Base on Past EventsThus far we showed how to detect events with discrete Sigma rules. In this section we leverage the flux-capacitor function to enable caching tags and testing tags of past events. As discussed in our previous articles, the flux-capacitor lets us detect parent-child relationships and also sequences of arbitrating features of past events.These types of Sigma rules need to simultaneously consider the tags of the current event and of past events. In order to perform the final rule evaluation, we introduce a Time travel tags step to retrieve all of past tags for an event and merge them with the current event. This is what the flux-capacitor function is designed to do, it caches and retrieves past tags. Now that past tags and current tags are on the same row, the Eval final condition can be evaluated just like we did in our discreet example above.The detection now looks like thisThe flux-capacitor is given the Sigma tags produced by the Pattern Match step. The flux-capacitor stores these tags for later retrieval. The column in red has the same schema as the Sigma tags column we used before. However, it combines current and past tags, which the flux-capacitor retrieved from its internal state.Adding caching and retrieval of past tags is easy thanks to the flux-capacitor function. Here’s how we applied the flux-capacitor function in our Spark anomaly detection. First, pass the dataframe produced by the Pattern Match step to the flux_stateful_function and the function returns another dataframe, which contains past tags.flux_update_spec = read_flux_update_spec()bloom_capacity = 200000# reference the scala codeflux_stateful_function = spark._sc._jvm.cccs.fluxcapacitor.FluxCapacitor.invoke# group logs by host_idjdf = flux_stateful_function( pattern_match_df._jdf, "host_id", bloom_capacity, flux_update_spec)output_df = DataFrame(jdf, spark)To control the behavior of the flux_stateful_function we pass in a flux_update_spec. The flux-capacitor specification is a yaml file produced by the Sigma compiler. The specification details which tags should be cached and retrieved and how they should be handled. The action attribute can be set to parent, ancestor or temporal.Let’s use a concrete example from Sigma HQ proc_creation_win_rundll32_executable_invalid_extension.ymlscreenshot from Sigma HQ githubAgain the heart of the detection consists of tags and of a final condition which puts all these tags together. Note however that this rule (that we will refer to as Rule 1) involves tests against CommandLine and also test on the parent process ParentImage. ParentImage is not a field found in the start-process logs. Rather it refers to the Image field of the parent process.As seen before, this Sigma rule will be compiled into SQL to evaluate the tags and to combine them into a final condition.In order to propagate the parent tags, the Sigma compiler also produces a flux-capacitor specification. Rule 1 is a parent rule and thus the specification must specify what are the parent and child fields. In our logs these correspond to id and parent_id.The specification also specifies which tags should be cached and retrieved by the flux-capacitor function. Here is the auto-generated specificationrules - rulename rule1 description proc_creation_win_run_executable_invalid_extension action parent tags - name filter_iexplorer - name filter_edge_update - name filter_msiexec_system32 parent parent_id child idNote Rule 0 is not included in the flux-capacitor function since it has no temporal tags.Illustrating Tag PropagationIn order to better understand what the flux-capacitor does, you can use the function outside a streaming analytic. Here we show a simple ancestor example. We want to propagate the tag pf. For example pf might represent a CommandLine containing rundll32.exe.spec = """ rules - rulename rule2 action ancestor child pid parent parent_pid tags - name pf """df_input = spark.sql(""" select * from values (TIMESTAMP '2022-12-30 000005', 'host1', 'pid500', '', map('rule1', map('pf', true, 'cf', false))), (TIMESTAMP '2022-12-30 000006', 'host1', 'pid600', 'pid500', map('rule1', map('pf', false, 'cf', false))), (TIMESTAMP '2022-12-30 000007', 'host1', 'pid700', 'pid600', map('rule1', map('pf', false, 'cf', true))) t(timestamp, host_id, pid, parent_pid, sigma) """)Printing the dataframe df_input we see that pid500 started and had a CommandLine with the pf feature. Then pid500 started pid600. Later pid600 started pid700. Pid700 had a child feature cf.+-------------------+------+----------+--------------+-------------------------------------+|timestamp |pid |parent_pid|human_readable|sigma |+-------------------+------+----------+--------------+-------------------------------------+|2022-12-30 000005|pid500| |[pf] |{rule2 -> {pf -> true, cf -> false}} ||2022-12-30 000006|pid600|pid500 |[] |{rule2 -> {pf -> false, cf -> false}}||2022-12-30 000007|pid700|pid600 |[cf] |{rule2 -> {pf -> false, cf -> true}} |+-------------------+------+----------+--------------+-------------------------------------+The Sigma rule is a combination of both pf and cf. In order to bring the pf tag back on the current row, we need to apply time-travel to the pf tag. Applying the flux-capacitor function to the df_input dataframejdf = flux_stateful_function(df_input._jdf, "host_id", bloom_capacity, spec, True)df_output = DataFrame(jdf, spark)We obtain the df_output dataframe. Notice how the pf tag is propagated through time.+-------------------+------+----------+--------------+------------------------------------+|timestamp |pid |parent_pid|human_readable|sigma |+-------------------+------+----------+--------------+------------------------------------+|2022-12-30 000005|pid500| |[pf] |{rule2 -> {pf -> true, cf -> false}}||2022-12-30 000006|pid600|pid500 |[pf] |{rule2 -> {pf -> true, cf -> false}}||2022-12-30 000007|pid700|pid600 |[pf, cf] |{rule2 -> {pf -> true, cf -> true}} |+-------------------+------+----------+--------------+------------------------------------+This notebook TagPropagationIllustration.ipynb contains more examples like this for parent-child and temporal proximity.Building Alerts with ContextThe flux-capacitor function caches all the past tags. In order to conserve memory, it caches these tags using bloom filter segments. Bloom filters have an extremely small memory footprint, are quick to query and to update. However, they do introduce possible false positive. It is thus possible that one of our detections is in fact a false positive. In order to remedy this we put the suspected anomalies in a queue (4) for re-evaluation.To eliminate false positives, the second Spark streaming job named the Alert Builder reads the suspected anomalies (5) and retrieves the events (6) that are required to re-evaluate the rule.For example in the case of a parent-child Sigma rule, the Alert Builder will read the suspected anomaly (5) retrieving a child process event. Next, in (6) it will retrieve the parent process of this child event. Then using these two events it re-evaluates the Sigma rule. However, this time the flux-capacitor is configured to store tags in a hash map, rather than in bloom filters. This eliminates false positives and as a bonus we have all the events involved in this detection. We store this alert along with the rows of evidence (parent and child events) into an alert table (7).Topology with stateful detections (temporal)The Alert Builder handles a fraction of the volume processed by (2) the Streaming Detections. Thanks to the low volume read in (5) historical searches into the tagged telemetry (6) are possible.For a more in-depth look, take a look at the Spark jobs for the Streaming Detections streaming_detections.py and the Alert Builder streaming_alert_builder.pyPerformanceTo evaluate the performance of this proof of concept we ran tests on machines with 16 CPU and 64G of ram. We wrote a simple data producer that creates 5,000 synthetic events per seconds and ran the experiment for 30 days.The Spark Streaming Detections job runs on one machine. The job is configured to trigger every minute. Each micro-batch (trigger) reads 300,000 events and takes on average 20 seconds to complete. The job can easily keep up with the incoming events rate.Spark Streaming DetectionsThe Spark Alert Builder also runs on a single machine and is configured to trigger every minute. This job takes between 30 and 50 seconds to complete. This job is very sensitive to organization of the tagged_telemetry_table . Here we see the effect of the maintenance job which organizes and sorts the latest data at every hour. Thus at every hour, the Spark Alert Builder’s micro-batch execution time drops back to 30 seconds.Spark Streaming Alert BuilderTable MaintenanceOur Spark streaming jobs trigger every minute and thus produce small data files every minute. In order to allow for fast searches and retrieval in this table, it’s important to compact and sort the data periodically. Fortunately Iceberg comes with built-in procedures to organize and maintain your tables.For example this script maintenance.py runs every hour to sort and compact the newly added files of the Iceberg tagged_telemetry_table.CALL catalog.system.rewrite_data_files( table => 'catalog.jc_sched.tagged_telemetry_table', strategy => 'sort', sort_order => 'host_id, has_temporal_proximity_tags', options => map('min-input-files', '100', 'max-concurrent-file-group-rewrites', '30', 'partial-progress.enabled', 'true'), where => 'timestamp >= TIMESTAMP \'2023-05-06 000000\' ' )At the end of the day we also re-sort this table, yielding maximum search performance over long search periods (months of data).CALL catalog.system.rewrite_data_files( table => 'catalog.jc_sched.tagged_telemetry_table', strategy => 'sort', sort_order => 'host_id, has_temporal_proximity_tags', options => map('min-input-files', '100', 'max-concurrent-file-group-rewrites', '30', 'partial-progress.enabled', 'true', 'rewrite-all', 'true'), where => 'timestamp >= TIMESTAMP \'2023-05-05 000000\' AND timestamp < TIMESTAMP \'2023-05-06 000000\' ' )Another maintenance task we do is deleting old data from the streaming tables. These tables are only used as buffers between producers and consumers. Thus every day we age off the streaming tables keeping 7 days of data.delete from catalog.jc_sched.process_telemetry_tablewhere timestamp < current_timestamp() - interval 7 daysFinally, every day we perform standard Iceberg table maintenance tasks, like expiring snapshots and removing orphan files. We run these maintenance jobs on all of our tables and schedule these jobs on Airflow.ConclusionIn this article we showed how build a Spark streaming anomaly detection framework that generically applies Sigma rules. New Sigma rules can easily be added to the system.This proof of concept was extensively tested on synthetic data to evaluate its stability and scalability. It shows great promise and further evaluation will be performed on a production system.All images unless otherwise noted are by the authorAnomaly Detection Using Sigma Rules Build Your Own Spark Streaming Detections was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.


Cordova App access Denied
Category: Android

When you have an error of Access Denied when adding ...


Views: 418 Likes: 93
Decoding the US Senate Hearing on Oversight of AI: NLP Analysis in Python
Decoding the US Senate Hearing on Oversight of AI ...

Photo by Harold Mendoza on UnsplashWord frequency analysis, visualization and sentiment scores using the NLTK toolkitLast Sunday morning, as I was switching TV channels trying to find something to watch while having breakfast, I stumbled upon a replay of the Senate Hearing on Oversight of AI. It had only been 40 minutes since it started, so I decided to watch the rest of it (Talk about an interesting way to spend a Sunday morning!).When events like the Senate Judiciary Subcommittee Hearing on Oversight of AI take place and you want to catch up on the key takeaways, you have four options witness it live, look for future recordings (both options would require three hours of your life); read the written version (transcripts), which are about 79 pages long and over 29,000 words; or read reviews on websites or social media to get different opinions and form your own ( if it’s not from others).Nowadays, with everything moving so quickly and our days feeling too short, it’s tempting to go for the shortcut and rely on reviews instead of going to the original source (I’ve been there too). If you choose the shortcut for this hearing, it’s highly probable that most reviews you’ll find on the web or social media focus on OpenAI CEO Sam Altman’s call for regulating AI. However, after watching the hearing, I felt there was more to explore beyond the headlines.So, after my Sunday funday morning activity, I decided to download the Senate Hearing transcript and use the NLTK Package (a Python package for natural language processing?—?NLP) to analyze it, compare most used words and apply some sentiment scores across different groups of interest (OpenAI, IBM, Academia, Congress) and see what could be between the lines. Spoiler alert! Out of the 29,000 words analyzed, only 70 (0.24%) were related to words like regulation, regulate, regulatory, or legislation.It’s important to note that this article is not about my takeaways from these AI hearing or Mr. ChatGPT Sam Altman. Instead, it focuses on what lies beneath the words of each part of society (Private, Academia, Government) represented in this session under the roof of Capitol Hill, and what we can learn from those words mixing with each other.Considering that the next few months are interesting times for the future of regulation on Artificial Intelligence, as the final draft of the EU AI Act awaits debate in the European Parliament (expected to take place in June), it’s worth exploring what’s behind the discussions surrounding AI on this side of the Atlantic.STEP-01 GET THE DATAI used the transcript published by Justin Hendrix in Tech Policy Press (accessible here).Access the Senate Hearing transcript hereWhile Hendrix mentions it’s a quick transcript and suggests confirming quotes by watching the Senate Hearing video, I still found it to be quite accurate and interesting for this analysis. If you want to watch the Senate Hearing or read the testimonies of Sam Altman (Open AI), Christina Montgomery (IBM), and Gary Marcus (Professor at New York University), you can find them here.Initially, I planned to copy the transcript to a Word document and manually create a table in Excel with the participants’ names, their representing organizations, and their comments. However, this approach was time-consuming and inefficient. So, I turned to Python and uploaded the full transcript from a Microsoft Word file into a data frame. Here is the code I used# STEP 01-Read the Word document# remember to install pip install python-docximport docximport pandas as pddoc = docx.Document('D\....your word file on microsoft word')items = []names = []comments = []# Iterate over paragraphs for paragraph in doc.paragraphs text = paragraph.text.strip() if text.endswith('') name = text[-1] else items.append(len(items)) names.append(name) comments.append(text)dfsenate = pd.DataFrame({'item' items, 'name' names, 'comment' comments})# Remove rows with empty commentsdfsenate = dfsenate[dfsenate['comment'].str.strip().astype(bool)]# Reset the indexdfsenate.reset_index(drop=True, inplace=True)dfsenate['item'] = dfsenate.index + 1print(dfsenate)The output should look like this item name comment0 1 Sen. Richard Blumenthal (D-CT) Now for some introductory remarks.1 2 Sen. Richard Blumenthal (D-CT) “Too often we have seen what happens when technology outpaces regulation, the unbridled exploitation of personal data, the proliferation of disinformation, and the deepening of societal inequalities. We have seen how algorithmic biases can perpetuate discrimination and prejudice, and how the lack of transparency can undermine public trust. This is not the future we want.”2 3 Sen. Richard Blumenthal (D-CT) If you were listening from home, you might have thought that voice was mine and the words from me, but in fact, that voice was not mine. The words were not mine. And the audio was an AI voice cloning software trained on my floor speeches. The remarks were written by ChatGPT when it was asked how I would open this hearing. And you heard just now the result I asked ChatGPT, why did you pick those themes and that content? And it answered. And I’m quoting, Blumenthal has a strong record in advocating for consumer protection and civil rights. He has been vocal about issues such as data privacy and the potential for discrimination in algorithmic decision making. Therefore, the statement emphasizes these aspects.3 4 Sen. Richard Blumenthal (D-CT) Mr. Altman, I appreciate ChatGPT’s endorsement. In all seriousness, this apparent reasoning is pretty impressive. I am sure that we’ll look back in a decade and view ChatGPT and GPT-4 like we do the first cell phone, those big clunky things that we used to carry around. But we recognize that we are on the verge, really, of a new era. The audio and my playing, it may strike you as curious or humorous, but what reverberated in my mind was what if I had asked it? And what if it had provided an endorsement of Ukraine, surrendering or Vladimir Putin’s leadership? That would’ve been really frightening. And the prospect is more than a little scary to use the word, Mr. Altman, you have used yourself, and I think you have been very constructive in calling attention to the pitfalls as well as the promise.4 5 Sen. Richard Blumenthal (D-CT) And that’s the reason why we wanted you to be here today. And we thank you and our other witnesses for joining us for several months. Now, the public has been fascinated with GPT, dally and other AI tools. These examples like the homework done by ChatGPT or the articles and op-eds, that it can write feel like novelties. But the underlying advancement of this era are more than just research experiments. They are no longer fantasies of science fiction. They are real and present the promises of curing cancer or developing new understandings of physics and biology or modeling climate and weather. All very encouraging and hopeful. But we also know the potential harms and we’ve seen them already weaponized disinformation, housing discrimination, harassment of women and impersonation, fraud, voice cloning deep fakes. These are the potential risks despite the other rewards. And for me, perhaps the biggest nightmare is the looming new industrial revolution. The displacement of millions of workers, the loss of huge numbers of jobs, the need to prepare for this new industrial revolution in skill training and relocation that may be required. And already industry leaders are calling attention to those challenges.5 6 Sen. Richard Blumenthal (D-CT) To quote ChatGPT, this is not necessarily the future that we want. We need to maximize the good over the bad. Congress has a choice. Now. We had the same choice when we face social media. We failed to seize that moment. The result is predators on the internet, toxic content exploiting children, creating dangers for them. And Senator Blackburn and I and others like Senator Durbin on the Judiciary Committee are trying to deal with it in the Kids Online Safety Act. But Congress failed to meet the moment on social media. Now we have the obligation to do it on AI before the threats and the risks become real. Sensible safeguards are not in opposition to innovation. Accountability is not a burden far from it. They are the foundation of how we can move ahead while protecting public trust. They are how we can lead the world in technology and science, but also in promoting our democratic values.6 7 Sen. Richard Blumenthal (D-CT) Otherwise, in the absence of that trust, I think we may well lose both. These are sophisticated technologies, but there are basic expectations common in our law. We can start with transparency. AI companies ought to be required to test their systems, disclose known risks, and allow independent researcher access. We can establish scorecards and nutrition labels to encourage competition based on safety and trustworthiness, limitations on use. There are places where the risk of AI is so extreme that we ought to restrict or even ban their use, especially when it comes to commercial invasions of privacy for profit and decisions that affect people’s livelihoods. And of course, accountability, reliability. When AI companies and their clients cause harm, they should be held liable. We should not repeat our past mistakes, for example, Section 230, forcing companies to think ahead and be responsible for the ramifications of their business decisions can be the most powerful tool of all. Garbage in, garbage out. The principle still applies. We ought to beware of the garbage, whether it’s going into these platforms or coming out of them.Next, I considered adding some labels for future analyis, identifying the individuals by the segment of society they representeddef assign_sector(name) if name in ['Sam Altman', 'Christina Montgomery'] return 'Private' elif name == 'Gary Marcus' return 'Academia' else return 'Congress'# Apply function dfsenate['sector'] = dfsenate['name'].apply(assign_sector)# Assign organizations based on namesdef assign_organization(name) if name == 'Sam Altman' return 'OpenAI' elif name == 'Christina Montgomery' return 'IBM' elif name == 'Gary Marcus' return 'Academia' else return 'Congress'# Apply functiondfsenate['Organization'] = dfsenate['name'].apply(assign_organization)print(dfsenate)Finally, I decided to add a column that counts the words from each statement, which could help us also for further analysis.dfsenate['WordCount'] = dfsenate['comment'].apply(lambda x len(x.split()))At this part, your dataframe should look like this item name ... Organization WordCount0 1 Sen. Richard Blumenthal (D-CT) ... Congress 51 2 Sen. Richard Blumenthal (D-CT) ... Congress 552 3 Sen. Richard Blumenthal (D-CT) ... Congress 1253 4 Sen. Richard Blumenthal (D-CT) ... Congress 1454 5 Sen. Richard Blumenthal (D-CT) ... Congress 197.. ... ... ... ... ...399 400 Sen. Cory Booker (D-NJ) ... Congress 156400 401 Sam Altman ... OpenAI 180401 402 Sen. Cory Booker (D-NJ) ... Congress 72402 403 Sen. Richard Blumenthal (D-CT) ... Congress 154403 404 Sen. Richard Blumenthal (D-CT) ... Congress 98STEP-02 VISUALIZE THE DATALet’s take a look at the numbers we have so far 404 questions or testimonies and almost 29,000 words. These numbers give us the material we need to get started. It’s important to know that some statements were split into smaller parts. When there were long statements with different paragraphs, the code divided them into separate statements, even though they were actually part of one contribution. To get a better understanding of each participant’s involvement, I also consider the number of words they used. This gave another perspective on their engagement.Hearing on Oversight of AI Figure 01As you can see in Figure 01, interventions by members of Congress represented more than half of all the hearings, followed by Sam Altman’s testimony. However, an alternate view obtained by counting the words from each side shows a more balanced representation between Congress (11 members) and the panel composed of Altman (OpenAI), Montgomery (IBM), and Marcus (Academia).It’s interesting to note the different levels of engagement among the members of Congress who participated in the Senate hearing (View table below) . As expected, Sen. Blumenthal, as the Subcommittee Chair, was highly engaged. But what about the other members? The table shows significant variations in engagement among all eleven participants. Remember, the quantity of contributions doesn’t necessarily indicate their quality. I’ll let you do your own judgement while you review the numbers.Lastly, even though Sam Altman received a lot of attention, it’s worth noting that Gary Marcus, despite it may appear that he had few participation opportunities, had a lot to say, as indicated by his word count, which is similar to Altman’s. Or is it maybe because academia often provides detailed explanations, while the business world prefers practicality and straightforwardness?Alright, professor Marcus, if you could be specific. This is your shot, man. Talk in plain English and tell me what, if any rules we ought to implement. And please don’t just use concepts. I’m looking for specificity.Sen. John Kennedy (R-LA). US Senate Hearing on Oversight of AI ( 2023)#*****************************PIE CHARTS************************************import pandas as pdimport matplotlib.pyplot as plt# Pie chart - Grouping by 'Organization' Questions&Testimoniesorg_colors = {'Congress' '#6BB6FF', 'OpenAI' 'green', 'IBM' 'lightblue', 'Academia' 'lightyellow'}org_counts = dfsenate['Organization'].value_counts()plt.figure(figsize=(8, 6))patches, text, autotext = plt.pie(org_counts.values, labels=org_counts.index, autopct=lambda p f'{p.1f}%({int(p * sum(org_counts.values) / 100)})', startangle=90, colors=[org_colors.get(org, 'gray') for org in org_counts.index])plt.title('Hearing on Oversight of AI Questions or Testimonies')plt.axis('equal')plt.setp(text, fontsize=12)plt.setp(autotext, fontsize=12)plt.show()# Pie chart - Grouping by 'Organization' (WordCount)org_colors = {'Congress' '#6BB6FF', 'OpenAI' 'green', 'IBM' 'lightblue', 'Academia' 'lightyellow'}org_wordcount = dfsenate.groupby('Organization')['WordCount'].sum()plt.figure(figsize=(8, 6))patches, text, autotext = plt.pie(org_wordcount.values, labels=org_wordcount.index, autopct=lambda p f'{p.1f}%({int(p * sum(org_wordcount.values) / 100)})', startangle=90, colors=[org_colors.get(org, 'gray') for org in org_wordcount.index])plt.title('Hearing on Oversight of AI WordCount ')plt.axis('equal')plt.setp(text, fontsize=12)plt.setp(autotext, fontsize=12)plt.show()#************Engagement among the members of Congress**********************# Group by name and count the rowsSummary_Name = dfsenate.groupby('name').agg(comment_count=('comment', 'size')).reset_index()# WordCount column for each nameSummary_Name ['Total_Words'] = dfsenate.groupby('name')['WordCount'].sum().values# Percentage distribution for comment_countSummary_Name ['comment_count_%'] = Summary_Name['comment_count'] / Summary_Name['comment_count'].sum() * 100# Percentage distribution for total_word_countSummary_Name ['Word_count_%'] = Summary_Name['Total_Words'] / Summary_Name['Total_Words'].sum() * 100Summary_Name = Summary_Name.sort_values('Total_Words', ascending=False)print (Summary_Name)+-------+--------------------------------+---------------+-------------+-----------------+--------------+| index | name | Interventions | Total_Words | Interv_% | Word_count_% |+-------+--------------------------------+---------------+-------------+-----------------+--------------+| 2 | Sam Altman | 92 | 6355 | 22.77227723 | 22.32252626 || 1 | Gary Marcus | 47 | 5105 | 11.63366337 | 17.93178545 || 15 | Sen. Richard Blumenthal (D-CT) | 58 | 3283 | 14.35643564 | 11.53184165 || 10 | Sen. Josh Hawley (R-MO) | 25 | 2283 | 6.188118812 | 8.019249008 || 0 | Christina Montgomery | 36 | 2162 | 8.910891089 | 7.594225298 || 6 | Sen. Cory Booker (D-NJ) | 20 | 1688 | 4.95049505 | 5.929256384 || 7 | Sen. Dick Durbin (D-IL) | 8 | 1143 | 1.98019802 | 4.014893393 || 11 | Sen. Lindsey Graham (R-SC) | 32 | 880 | 7.920792079 | 3.091081527 || 5 | Sen. Christopher Coons (D-CT) | 6 | 869 | 1.485148515 | 3.052443008 || 12 | Sen. Marsha Blackburn (R-TN) | 14 | 869 | 3.465346535 | 3.052443008 || 4 | Sen. Amy Klobuchar (D-MN) | 11 | 769 | 2.722772277 | 2.701183744 || 13 | Sen. Mazie Hirono (D-HI) | 7 | 755 | 1.732673267 | 2.652007447 || 14 | Sen. Peter Welch (D-VT) | 11 | 704 | 2.722772277 | 2.472865222 || 3 | Sen. Alex Padilla (D-CA) | 7 | 656 | 1.732673267 | 2.304260775 |+-------+--------------------------------+---------------+-------------+-----------------+--------------+STEP-03 TOKENIZATIONHere is where the natural language processing (NLP) fun begins. To analyze the text, we’ll use the NLTK Package in Python. It provides useful tools for word frequency analysis and visualization. The following libraries and modules would provide the necessary tools for word frequency analysis and visualization.#pip install nltk#pip install spacy#pip install wordcloud#pip install subprocess#python -m spacy download enFirst, we’ll start with Tokenization, which means breaking the text into individual words, also known as “tokens.” For this, we’ll use spaCy, an open-source NLP library that can handle contractions, punctuation, and special characters. Next, we’ll remove common words that don’t add much meaning, like “a,” “an,” “the,” “is,” and “and,” using the stop word resource from the NLTK library. Finally, we’ll apply Lemmatization which reduces words to their base form, known as the lemma. For example, “running” becomes “run” and “happier” becomes “happy.” This technique helps us work with the text more effectively and understand its meaning.To summarizeo Tokenize the text.o Remove common words.o Apply Lemmatization.#***************************WORD-FRECUENCY*******************************import subprocessimport nltkimport spacyfrom nltk.probability import FreqDistfrom nltk.corpus import stopwords# Download resourcessubprocess.run('python -m spacy download en', shell=True)nltk.download('punkt')# Load spaCy model and set stopwordsnlp = spacy.load('en_core_web_sm')stop_words = set(stopwords.words('english'))def preprocess_text(text) words = nltk.word_tokenize(text) words = [word.lower() for word in words if word.isalpha()] words = [word for word in words if word not in stop_words] lemmas = [token.lemma_ for token in nlp(" ".join(words))] return lemmas# Aggregate words and create Frecuency Distributionall_comments = ' '.join(dfsenate['comment'])processed_comments = preprocess_text(all_comments)fdist = FreqDist(processed_comments)#**********************HEARING TOP 30 COMMON WORDS*********************import matplotlib.pyplot as pltimport numpy as np# Most common words and their frequenciestop_words = fdist.most_common(30)words = [word for word, freq in top_words]frequencies = [freq for word, freq in top_words]# Bar plot-Hearing on Oversight of AITop 30 Most Common Wordsfig, ax = plt.subplots(figsize=(8, 10))ax.barh(range(len(words)), frequencies, align='center', color='skyblue')ax.invert_yaxis()ax.set_xlabel('Frequency', fontsize=12)ax.set_ylabel('Words', fontsize=12)ax.set_title('Hearing on Oversight of AITop 30 Most Common Words', fontsize=14)ax.set_yticks(range(len(words)))ax.set_yticklabels(words, fontsize=10)ax.spines['right'].set_visible(False)ax.spines['top'].set_visible(False)ax.spines['left'].set_linewidth(0.5)ax.spines['bottom'].set_linewidth(0.5)ax.tick_params(axis='x', labelsize=10)plt.subplots_adjust(left=0.3)for i, freq in enumerate(frequencies) ax.text(freq + 5, i, str(freq), va='center', fontsize=8)plt.show()Hearing on Oversight of AI Figure 02As you can see in the bar plot (Figur 02) , there was a lot of “Thinking”. Maybe the first five words give us an interesting hint of what we should do today and for our future in terms of AI“We need to think and know where AI should go”.As I mentioned at the beginning of this article, at first sight, “regulation” doesn’t stand out as a frequently used word in the Senate AI Hearing. However, concluding that it wasn’t a topic of main concern could be inaccurate . The interest in whether AI should or should not be regulated was expressed in different words such as “regulation”, “regulate”, “agency” or “regulatory”. Therefore, lets make some adjustments to the code, aggregate these words, and re-run the bar plot to see how it impacts the analysis.nlp = spacy.load('en_core_web_sm')stop_words = set(stopwords.words('english'))def preprocess_text(text) words = nltk.word_tokenize(text) words = [word.lower() for word in words if word.isalpha()] words = [word for word in words if word not in stop_words] lemmas = [token.lemma_ for token in nlp(" ".join(words))] return lemmas# Aggregate words and create Frecuency Distributionall_comments = ' '.join(dfsenate['comment'])processed_comments = preprocess_text(all_comments)fdist = FreqDist(processed_comments)original_fdist = fdist.copy() # Save the original objectaggregate_words = ['regulation', 'regulate','agency', 'regulatory','legislation']aggregate_freq = sum(fdist[word] for word in aggregate_words)df_aggregatereg = pd.DataFrame({'Word' aggregate_words, 'Frequency' [fdist[word] for word in aggregate_words]})# Remove individual words and add aggregationfor word in aggregate_words del fdist[word]fdist['regulation+agency'] = aggregate_freq# Pie chart for Regulation+agency distributionimport matplotlib.pyplot as pltlabels = df_aggregatereg['Word']values = df_aggregatereg['Frequency']plt.figure(figsize=(8, 6))plt.subplots_adjust(top=0.8, bottom=0.25) patches, text, autotext = plt.pie(values, labels=labels, autopct=lambda p f'{p.1f}%({int(p * sum(values) / 100)})', startangle=90, colors=['#6BB6FF', 'green', 'lightblue', 'lightyellow', 'gray'])plt.title('Regulation+agency Distribution', fontsize=14)plt.axis('equal')plt.setp(text, fontsize=8) plt.setp(autotext, fontsize=8) plt.show()Hearing on Oversight of AI Figure 03As you can see in Figure-03, the topic of regulation was after all many times during the Senate AI Hearing.STEP-04 WHAT HIDES BEHIND THE WORDSWords alone may provide us with some clues, but it is the interconnection of words that truly offers us some perspective. So, let’s take an approach using word clouds to explore if we can discover insights that cannot be shown by simple bar and pie charts.# Word cloud-Senate Hearing on Oversight of AIfrom wordcloud import WordCloudwordcloud = WordCloud(width=800, height=400, background_color='white').generate_from_frequencies(fdist)plt.figure(figsize=(10, 5))plt.imshow(wordcloud, interpolation='bilinear')plt.axis('off')plt.title('Word Cloud - Senate Hearing on Oversight of AI')plt.show()Hearing on Oversight of AI Figure 04Let’s explore further and compare the word clouds for the different groups of interest represented in the AI Hearing (Private, Congress, Academia) and see if they words reveal different perspectives on the future of AI.# Word clouds for each group of Interestorganizations = dfsenate['Organization'].unique()for organization in organizations comments = dfsenate[dfsenate['Organization'] == organization]['comment'] all_comments = ' '.join(comments) processed_comments = preprocess_text(all_comments) fdist_organization = FreqDist(processed_comments) # Word clouds wordcloud = WordCloud(width=800, height=400, background_color='white').generate_from_frequencies(fdist_organization) plt.figure(figsize=(10, 5)) plt.imshow(wordcloud, interpolation='bilinear') plt.axis('off') if organization == 'IBM' plt.title(f'Word Cloud {organization} - Christina Montgomery') elif organization == 'OpenAI' plt.title(f'Word Cloud {organization} - Sam Altman') elif organization == 'Academia' plt.title(f'Word Cloud {organization} - Gary Marcus') else plt.title(f'Word Cloud {organization}') plt.show()Hearing on Oversight of AI Figure 05It’s interesting how some words appear (or disappear) for each group of interest represented in the Senate AI Hearing while they talk about artificial intelligence.In terms of the big heading, “Sam Altman’s call for regulating AI” ; well, if he is in favor of regulation or not, I really can’t tell, but it doesn’t seem to have much regulation in its words to me.Instead, Sam Altman seems to have a people-centric approach when he talks about AI, repeating words like “think,” “people,” “know,” “important,” and “use,” and relies more on words like “technology” ,”system” or “model” instead of using the word “AI”.Someone that did had something to say about “risk”, and “issues” was Christina Montgomery (IBM) who repeated this words constantly when talking about “technology”, “companies” and “AI”. Interesting fact in her testimony, is finding words that most of all expect to hear from companies involved in developing technology ; “trust”, “governance” and “think” what it’s “right” in terms of AI.“We need to hold companies responsible today and accountable for AI that they’re deploying…..”Christina Montgomery. US Senate Hearing on Oversight of AI ( 2023)Gary Marcus in his initial statement said, ‘“I come as a scientist, someone who’s founded AI companies, and is someone who genuinely loves AI…” So, for the sake of this NLP analysis, we are considering him as a representation of the voice of Academia. Words like “need”, “think”, “know”, “go” , “people” stand out among others. An interesting fact is that the word “system” seems to be repeated more than “AI” in his testimony. Maybe AI it’s not a single lone technology that would change the future, the impact on the future will come from multiple technologies or systems interacting with each other (IoT, robotics, BioTech, etc.) rather than relying solely on one of them.At the end, the first hypothesis mentioned by Senator John Kennedy seems not entirely false after all (not just for Congress but for society as a whole). We are still in that stage where we are trying to understand the direction AI is heading.“Permit me to share with you three hypotheses that I would like you to assume for the moment to be true. Hypothesis number one, many members of Congress do not understand artificial intelligence. Hypothesis. Number two, that absence of understanding may not prevent Congress from plunging in with enthusiasm and trying to regulate this technology in a way that could hurt this technology. Hypothesis number three, that I would like you to assume there is likely a berserk wing of the artificial intelligence community that intentionally or unintentionally could use artificial intelligence to kill all of us and hurt us the entire time that we are dying…..”Sen. John Kennedy (R-LA). US Senate Hearing on Oversight of AI ( 2023)STEP-05 THE EMOTION BEHIND YOUR WORDSWe’ll use the SentimentIntensityAnalyzer class from the NLTK library for sentiment analysis. This pre-trained model uses a lexicon-based approach, where each word in the lexicon (VADER) has a predefined sentiment polarity value. The sentiment scores of the words in a piece of text are aggregated to calculate an overall sentiment score. The numerical value ranges from -1 (negative sentiment) to +1 (positive sentiment), with 0 indicating a neutral sentiment. Positive sentiment reflects a favorable emotion, attitude, or enthusiasm, while negative sentiment conveys an unfavorable emotion or attitude.#************SENTIMENT ANALYSIS************from nltk.sentiment import SentimentIntensityAnalyzernltk.download('vader_lexicon')sid = SentimentIntensityAnalyzer()dfsenate['Sentiment'] = dfsenate['comment'].apply(lambda x sid.polarity_scores(x)['compound'])#************BOXPLOT-GROUP OF INTEREST************import seaborn as snsimport matplotlib.pyplot as pltsns.set_style('white')plt.figure(figsize=(12, 7))sns.boxplot(x='Sentiment', y='Organization', data=dfsenate, color='yellow', width=0.6, showmeans=True, showfliers=True)# Customize the axis def add_cosmetics(title='Sentiment Analysis Distribution by Group of Interest', xlabel='Sentiment') plt.title(title, fontsize=28) plt.xlabel(xlabel, fontsize=20) plt.xticks(fontsize=15) plt.yticks(fontsize=15) sns.despine()def customize_labels(label) if "OpenAI" in label return label + "-Sam Altman" elif "IBM" in label return label + "-Christina Montgomery" elif "Academia" in label return label + "-Gary Marcus" else return label# Apply customized labels to y-axisyticks = plt.yticks()[1]plt.yticks(ticks=plt.yticks()[0], labels=[customize_labels(label.get_text()) for label in yticks])add_cosmetics()plt.show()Hearing on Oversight of AI Figure 06A boxplot is always interesting as it shows the minimum and maximum values, the median, the first (Q1) and third (Q3) quartiles. In addition, a line of code was added to display the mean value. (Acknowledgment to Elena Kosourova for designing the boxplot code template; I only made adjustments for my dataset).Overall, everyone seemed to be in a good mood during the Senate Hearing, especially Sam Altman, who stood out with the highest sentiment score, followed by Christina Montgomery. On the other hand, Gary Marcus seemed to have a more neutral experience (median around 0.25) and he may have felt somewhat uncomfortable at times, with values close to 0 or even negative. In addition, Congress as a whole displayed a left-skewed distribution in its sentiment scores, indicating a tendency towards neutrality or positivity. Interestingly, if we take a closer look, certain interventions stood out with extremely high or low sentiment scores.Hearing on Oversight of AI Figure 07Maybe we should interpret the results not as if people in the Senate AIHearing were happy or uncomfortable. Maybe this suggest that those who participate in the Hearing may not hold an overly optimistic view of where AI is headed, but at the same time, they are not pessimistic either. The scores may indicate that there are some concerns and are being cautious about the direction AI should take.And what about a timeline? Did the mood during the hearing stay the same throughout? How did the mood of each group of interest evolve? To analyze the timeline, I organized the statements in the order they were captured and conducted a sentiment analysis. Since there are over 400 questions or testimonies, I defined a moving average of the sentiment scores for each group of interest ( Congress, Academia, Private) , using a window size of 10. This means that the moving average is calculated by averaging the sentiment scores over every 10 consecutive statements#**************************TIMELINE US SENATE AI HEARING**************************************import seaborn as snsimport matplotlib.pyplot as pltimport numpy as npfrom scipy.interpolate import make_interp_spline# Moving average for each organizationwindow_size = 10 organizations = dfsenate['Organization'].unique()# Create the line plotcolor_palette = sns.color_palette('Set2', len(organizations))plt.figure(figsize=(12, 6))for i, org in enumerate(organizations) df_org = dfsenate[dfsenate['Organization'] == org] # moving average df_org['Sentiment'].fillna(0, inplace=True) # missing values filled with 0 df_org['Moving_Average'] = df_org['Sentiment'].rolling(window=window_size, min_periods=1).mean() x = np.linspace(df_org.index.min(), df_org.index.max(), 500) spl = make_interp_spline(df_org.index, df_org['Moving_Average'], k=3) y = spl(x) plt.plot(x, y, linewidth=2, label=f'{org} {window_size}-Point Moving Average', color=color_palette[i])plt.xlabel('Statement Number', fontsize=12)plt.ylabel('Sentiment Score', fontsize=12)plt.title('Sentiment Score Evolution during the Hearing on Oversight of AI', fontsize=16)plt.legend(fontsize=12)plt.grid(color='lightgray', linestyle='--', linewidth=0.5)plt.axhline(0, color='black', linewidth=0.5, alpha=0.5)for org in organizations df_org = dfsenate[dfsenate['Organization'] == org] plt.text(df_org.index[-1], df_org['Moving_Average'].iloc[-1], f'{df_org["Moving_Average"].iloc[-1].2f}', ha='right', va='top', fontsize=12, color='black')plt.tight_layout()plt.show()Hearing on Oversight of AI Figure 08At the beginning, it seemed like the session was friendly and optimistic, with everyone discussing the future of AI. But as the session went on, the mood started to change. The members of Congress became less optimistic, and their questions became more challenging. This affected the panelists’ scores, with some even getting low scores (you can see this towards the end of the session). Interestingly, Altman was seen by the model as neutral or slightly positive, even during the tense moments with the members of Congress.It’s important to remember that the model has its limitations and could border on subjectivity. While sentiment analysis isn’t flawless, it offers us an interesting glimpse into the intensity of emotions that prevailed on that day in Capitol Hill.Final thoughtIn my opinion, the lessons behind this US Senate AI Hearing lie in the five most repeated words “We need to think and know where AI should go”. It is noteworthy that words like “people” and “importance” were unexpectedly present in Sam Altman’s word cloud, going beyond the headline for a “Call for regulation”. While I hoped to find more words like “transparency”, “accountability”, “trust”, “governance”, and “fairness” in Altman’s NLP analysis, it was a relief to find some of them frequently repeated in Christina Montgomery’s testimony. This is what we are all expecting to hear more frequently when AI is on the table.Gary Marcus emphasized “system” as much as “AI”, perhaps inviting us to see Artificial Intelligence in a broader context. Multiple technologies are emerging right now, and their combined impact on society, work, and employment in the future will come from the clash of these multiple technologies, not just from one of them. Academia plays a vital role in guiding this path, and if some kind of regulation is needed.I say this “literally” not “spiritually” (inside joke from the six-month moratorium letter).Finally, the word “Agency” was repeated as much as “Regulation” in its different forms. This suggests that the concept of an “Agency for AI” and its role will likely be a topic of debate in the near future. An interesting reflection on this challenge was mentioned in the Senate AI Hearing by Sen. Richard Blumenthal“…Most of my career has been an enforcement. And I will tell you something, you can create 10 new agencies, but if you don’t give them the resources, and I’m talking not just about dollars, I’m talking about scientific expertise, you guys will run circles around ’em. And it isn’t just the, the models or the generative AI that will run models around run circles around them, but it is the scientists in your companies. For every success story in government regulation, you can think of five failures…. And I hope our experience here will be different…”Sen. Richard Blumenthal (D-CT). US Senate Hearing on Oversight of AI ( 2023)Although reconciling innovation, awareness, and regulation for me is challenging, I am all for raising awareness about AI’s role in our present and future but also understanding that “research” and “development” are different things. The first one should be encouraged and promoted, not contained,the second one is where the extra effort in the “thinking” and “knowing” is needed.I hope you found this NLP analysis interesting and I want to thank Justin Hendrix and Tech Policy Press for allowing me to use their transcript in this article. You can access the complete code in this GitHub repository. (Acknowledgement also to ChatGPT for helping me fine-tune some of my code for a better presentation).Did I miss anything? Your suggestions are always welcome and keep the conversation going.Decoding the US Senate Hearing on Oversight of AI NLP Analysis in Python was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.


Software Best Practices Learned by Experience
Category: System Design

[Updated] It is considered good practice to cache your data in memory, either o ...


Views: 0 Likes: 38
Why Open Source Libraries are the Future of Softwa ...
Category: Computer Programming

We have seen famous Social Networks like Facebook being made using ...


Views: 0 Likes: 30
Chapter 6: Building a Supportive Community
Category: Lipedema

Living with lipedema can feel isolating, but finding a supportive community can provide comfort, ...


Views: 0 Likes: 24
Exploring Research Frontiers in Lipedema
Category: Lipedema

Since we know that lipedema is a condition characterized by abnormal accumulation of fat, often m ...


Views: 0 Likes: 12

Login to Continue, We will bring you back to this content 0



For peering opportunity Autonomouse System Number: AS401345 Custom Software Development at ErnesTech Email Address[email protected]