I have a dataframe in df.sentence column have long sentences. I am trying to extract arg0 with Semantic Role Labeling and save the arg0 in a separate column.
I keep getting this error:
RuntimeError: The size of tensor a (1212) must match the size of tensor b (512) at non-singleton dimension 1
here comes my code:
!pip install allennlp==2.1.0 allennlp-models==2.1.0
from allennlp.predictors.predictor import Predictor
import allennlp_models.tagging
import pandas as pd, csv
def extract_arg0(sentence):
result = []
output = predictor.predict(sentence)
for verb in output['verbs']:
desc = verb['description']
arg0_start = desc.find('ARG0: ')
if arg0_start > -1:
arg0_end = arg0_start + len('ARG0: ')
arg0 = desc[arg0_end: desc.find(']')]
result.append((verb['verb'], arg0))
return result
# How to loop over all sentences
from tqdm.notebook import tqdm
tqdm.pandas()
df['Arg0'] = df.sentence.progress_apply(extract_arg0)
I think I should create a new line of code here that could skip, instead of throwing error, and add df.arg0 'failed' .. etc. is my approach right? if so, any idea on how I can add that line in my code? if not, any suggestion would be appreciated.
Note: I think the most appropriate approach would be to proceed with longformer. I also checked for any approach with longformer, couldnot find any. I would also appreciate any recommendation on this.
I also tried with
'!pip install allennlp==2.1.0 allennlp-models==2.9.0'
An example of my data:
import pandas as pd
data = {'sentence': ['in addition, our regulatory posture and related expenses have been and will continue to be affected by changes in regulatory expectations for global systemically important financial institutions applicable to, among other things, risk management, liquidity and capital planning and compliance programs, and changes in governmental enforcement approaches to perceived failures to comply with regulatory or legal obligations;•adverse changes in the regulatory ratios that we are required or will be required to meet, whether arising under the dodd-frank act or the basel iii final rule, or due to changes in regulatory positions, practices or regulations in jurisdictions in which we engage in banking activities, including changes in internal or external data, formulae, models, assumptions or other advanced systems used in the calculation of our capital ratios that cause changes in those ratios as they are measured from period to period;•increasing requirements to obtain the prior approval of the federal reserve or our other u.s. and non-u.s. regulators for the use, allocation or distribution of our capital or other specific capital actions or programs, including acquisitions, dividends and stock purchases, without which our growth plans, distributions to shareholders, share repurchase programs or other capital initiatives may be restricted;•changes in law or regulation, or the enforcement of law or regulation, that may adversely affect our business activities or those of our clients or our counterparties, and the products or services that we sell, including additional or increased taxes or assessments thereon, capital adequacy requirements, margin requirements and changes that expose us to risks related to the adequacy of our controls or compliance programs;•financial market disruptions or economic recession, whether in the u.s., europe, asia or other regions;•our ability to develop and execute state street beacon, our multi-year program to create cost efficiencies through changes to our operations and to further digitize our service delivery to our clients, any failure of which, in whole or in part, may among other things, reduce our competitive position, diminish the cost-effectiveness of our systems and processes or provide an insufficient return on our associated investment;•our ability to promote a strong culture of risk management, operating controls, compliance oversight and governance that meet our expectations and those of our clients and our regulators;•the results of our review of the manner in which we invoiced certain client expenses, including the amount of expenses determined to be reimbursable, as well as potential consequences of such review including with respect to our client relationships and potential investigations by regulators;•the results of, and costs associated with, governmental or regulatory inquiries and investigations, litigation and similar claims, disputes, or proceedings;•the potential for losses arising from our investments in sponsored investment funds;•the possibility that our clients will incur substantial losses in investment pools for which we act as agent, and the possibility of significant reductions in the liquidity or valuation of assets underlying those pools;•our ability to anticipate and manage the level and timing of redemptions and withdrawals from our collateral pools and other collective investment products;•the credit agency ratings of our debt and depository obligations and investor and client perceptions of our financial strength;•adverse publicity, whether specific to state street or regarding other industry participants or industry-wide factors, or other reputational harm;•our ability to control operational risks, data security breach risks and outsourcing risks, our ability to protect our intellectual property rights, the possibility of errors in the quantitative models we use to manage our business and the possibility that our controls will prove insufficient, fail or be circumvented;•our ability to expand our use of technology to enhance the efficiency, accuracy and reliability of our operations and our dependencies on information technology and our ability to control related risks, including cyber-crime and other threats to our information technology infrastructure and systems and their effective operation both independently and with external systems, and complexities and costs of protecting the security of our systems and data;18 •our ability to grow revenue, manage expenses, attract and retain highly skilled people and raise the capital necessary to achieve our business goals and comply with regulatory requirements and expectations;•changes or potential changes to the competitive environment, including changes due to regulatory and technological changes, the effects of industry consolidation and perceptions of state street as a suitable service provider or counterparty;•changes or potential changes in the amount of compensation we receive from clients for our services, and the mix of services provided by us that clients choose;•our ability to complete acquisitions, joint ventures and divestitures, including the ability to obtain regulatory approvals, the ability to arrange financing as required and the ability to satisfy closing conditions;•the risks that our acquired businesses and joint ventures will not achieve their anticipated financial and operational benefits or will not be integrated successfully, or that the integration will take longer than anticipated, that expected synergies will not be achieved or unexpected negative synergies or liabilities will be experienced, that client and deposit retention goals will not be met, that other regulatory or operational challenges will be experienced, and that disruptions from the transaction will harm our relationships with our clients, our employees or regulators;•our ability to recognize emerging needs of our clients and to develop products that are responsive to such trends and profitable to us, the performance of and demand for the products and services we offer, and the potential for new products and services to impose additional costs on us and expose us to increased operational risk;•changes in accounting standards and practices; and•changes in tax legislation and in the interpretation of existing tax laws by u.s. and non-u.s. tax authorities that affect the amount of taxes due.actual outcomes and results may differ materially from what is expressed in our forward-looking statements and from our historical financial results due to the factors discussed in this section and elsewhere in this form 10-k or disclosed in our other sec filings.', 'we have many risks to deal with', ' financial risk causes us to lose billions of dollars'],
'second': ['a1', 'a1', 'a3']}
df= pd.DataFrame(data)
You don't show what kind of predictor
you're loading, but I suspect that the model can only handle 512 word pieces. Maybe Longformer would be a solution, but then you'd have to train the SRL model with Longformer first.
Think about what you actually want to accomplish though. Your example sentence is actually multiple sentences, with a bulleted list of more sentences at the end. The AllenNLP SRL model was never trained on that kind of input data, and will not perform well anyways. I suggest you split the input into sentences and feed in one sentence at a time. That'll be closer to the kind of data the model has seen at training time, so you will get better results from it.